MatthewRalston / kmerdb

Python bioinformatics CLI for k-mer counts and de Bruijn graphs
https://matthewralston.github.io/kmerdb
Apache License 2.0
12 stars 1 forks source link

Graph algorithms #122

Open MatthewRalston opened 4 months ago

MatthewRalston commented 4 months ago

Here, the kmerdb project will be pivoting after the 0.7.6 release to use a modified .kdb format and no backwards compatibility is explicitly planned.

The goal of the refactor/pivot is to introduce networkx and/or cugraph to the possible toolkits used to facilitate the implementation of an assembly algorithm AND/OR a .kdbg format specification for exact .fasta assembly or approximate 'Eulerian' walk (.fastq) through the rows specified in the "Assembly algorithm prototype" Github milestone.

MatthewRalston commented 4 months ago

Today, progress was made on generating the graph for the Eulerian walk. Metadata format/schema largely remains the same, and so far the main schema consists of three col. N1 n2 and w. The w is specified for the Eulerian path, however that might be implemented.

Started reading the wiki article on plessy v Ferguson. I know Americans on the left love to shout from the rooftops that kumbaya for all is here. It's '24 after all. And yet, 50% of the populace was not dissuaded by the use of flagrantly racist language and rhetoric by the ex President Donnie grump (voldy 2.0)

And what's more startling I guess is the disparities between celebrities on television, and the reality of many as and other minority and also white persons in different housing districts than the elite. whoa but just on the deep hip hop dose on the st life thing is making my head ache. if you know you know. people get ignorant about waste and ignorant about love. tytgs.

I was fired for missing one ducking email and bc I wrestle about forgiving you for not defending me on that issue. You're privileged and I got dumped. Steve and Deb, you're no different. Doesn't matter what you thought now or then, I stuck up with your group when you needed an extra head. I got the work done.

Go to hell.

That's what I think of the goddak establishment.

Here I am door dashing and begging my parents to cut my interest rates so I can afford to eat. You pos won't ever understand that.and I hope you never have to. But miss me with that kumbaya ish rn.

on the upside, fuggin hate my brand but love the game. different directions both re self study, metrics, profiling, and graphics. still need a more concrete problem to make the feature on the algorithm biorxiv. that's what's got me stuck in loops re money.

that would be the real assembly algo and the future goal, but we might only have time for a networkx cpu strategy and then a cugraph assembler could leverage the indexing structure .kdb.gi to produce tuples rapidly to python to transfer to the gpu for a cugraph graph traversal after trimming. the networkx assembler would leverage the same thing. this is essentialy milestone 2

because id rather take the right whip out into the country and gather field samples, than to get stuck in a wfh situation churning my money on finding better digital samples when i'd rather do something combining field, wet bench, and then maybe some fastq exploration with maybe a model of the graph and the best case scenarios re: known genome (eco, bsub, cdiff, etc) full assembly (n50, ng50, contig count, orf count/gene count, pfam stats, other orthology/paralogy metrics, contig diversity, read diversity), and/or approximate Eulerian walk (after edge and node trimming strategies, followed by like.... idk yet.)

MatthewRalston commented 4 months ago

Is this where I migrate plans from issue to milestone and/or documentation by modifying obj in comments and then official planning checkmarks?

NEXT:

First block could be '\n' deltimited rows. count vec (n1) and index (n2==n1) [n2 is the 4**k dimensional 1-tuple/vector) Second block could be delimited with uWu. edge 2-tuple vector/array (n1) and weight (n2) np.array [n1 is the number of possible edges (WOOF), n2 driver variable may be --sparse or (default: --)inclusive. Inclusive makes the full matrix (don't want) in flattened form and then compressed. Sparse storage may make the n2 more reasonable, but the adjacency structure in unstructured form may make indexing worse. If the algorithm for accessing index rapidly is written in Python (or Cython), then accessing the index table(.kdbgi) should be trivial. If this feature is developed more, an in-memory solver may be next. If the --sparse option is developed. Third block also closed with uwu.

Just remove the edges where weight=0

Human readability

MatthewRalston commented 3 months ago

122 #123 #124 #125 #126

Neighbor construction working out well. A dictionary of dictionaries is being used to focus on local "neighbor" space only: i.e. the 8 adjacent kmers to any id.

This has been spun off in a utility function in kmer.py.

Adding some more documentation to the kmerdb.graph module as it prototypes most of what is required to write and read (some validations) .kdbg files.

Edge list and data structure still in planning.