MatthewRalston / kmerdb

Python bioinformatics CLI for k-mer counts and de Bruijn graphs
https://matthewralston.github.io/kmerdb
Apache License 2.0
12 stars 1 forks source link

Expanded row metadata for graph format #130

Open MatthewRalston opened 5 months ago

MatthewRalston commented 5 months ago

Key Question

What is needed for working data structure initialization? Why isn't it working?

The node and edge list and prioritization or sort strategy for edge representation, weights, > multigraph and combination representation, orientation of edges, dual strandedness and .kdbg row metadata (non-int, but Boolean) (i.e. fast lookup) row metadata fields is not yet finalized.

[[ walk file ]]

Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 .1 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forking nodes, previous walka investigate and their node IDs)

schema concepts

for format versions of course...

Should be self referential, contain nodes, edges, and walks and/or paths. Metadata includes relevant references to schema versioning, and specific file references for interpretation.

[ minimal walks ]

A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge prospective bool = True when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed provided source reference)

solutional path

a walk, along with all previous walks (in chronological aka integer id, by reference, along with the sha256sum of the git release that produced the walk, the metadata, etc...

[[ solutional path file ]]

Header metadata will have the source and the parameters in the header. And a walk id - (a sha256 of the walk) for an associated walk file, and walk name (given at "runtime" via CLI). May be 0 to represent unspecific or unqualified walk (origin unclear)

Related issues

126 #122 #125 #102 #124

sidenote

The neighbor structure 🌪️is manifested by particular kmer IDs🌬️, which may be accessed from kmer arrays loaded alongside the edge list during a path producing process.

A working pipeline would include all components of the workflow onto the next step but all commands are partial. Schemas' in planning stage for future release

MatthewRalston commented 5 months ago

Key Question

What is needed for working data structure initialization? Why isn't it working?

Node files

No comment

Edge files

Not applicable

types of walks

[[ node schema (in progress) ]]

[[ Edge schema ]] ---------

[[ Walk schema (in progress) ]]

[[ The walk file ]]

Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 0 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forks investigate and their node IDs)

minimal walks

A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge prospective bool = True when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed origin id)

Related issues

Issues #126 #122 #125 #102 #124

@MatthewRalston thinks the path forward towards a graph format is in creating additional structural definitions. If i think through the relationships preserved among different incomplete and completely self-referential formats, they require associated metadata schemas, and the utility function of taking a table or metadata schematic input and generating a consistently hashable representation (the metadata header format, it's parser, and the table parsing functionality, as in these modules)...

i.e. "the format(s)"

And associated schemas...

This utility function wouldn't be part of the algorithm per-se, but it would be incident to that which is produced by virtue of the file-metadata-log (and this version-dataset pairing) thingawhosit. That's mostly contained in our __init__, and associated module files for format access and associated value provided from features and solutions in future versions.

and tying that to a git sha256 hash, should be preserved with all nodes of a given wall or path

MatthewRalston commented 5 months ago

This issue has been tabled for the time being in favor of a cleaner UI and experience on the user end.

1. Interface overhaul (issue #132)

I want the user to understand the output and even ASCII styling (in absence of a rich.py dependency, which isn't needed)

output_dir

I want the logfile and output directories (required to collect .kdb, .kdbg, .stats.txt, output.log etc)

usage, steps, and features

I want the expanded help and usage statements, including the 'features' and 'steps' developed further.

minimal STDOUT

And finally, I want the STDOUT to be extremely minimal and/or non-existent, in the profile and graph commands. OR the formatting should display the resulting stats clearly apart from the header.

README "2.0" (issue #137)

Finally, readme overhaul

MatthewRalston commented 1 month ago

Okay, I've been working on some other features and needed documentation/UI overhauls. Delays pushed deadline back a few months, reprioritizing the assembly algorithm and possible numba/Python etc implementations of D2 metrics, more odds-ratio stuff on the horizon, more literature review and beginning to write a report and lit review on applications of kmer count matrices and distances to metagenomics and microbiomes.