Open MatthewRalston opened 8 months ago
What is needed for working data structure initialization? Why isn't it working?
No comment
Not applicable
Walk files
Path files
Tree files Contains:
walks from/to "central/incidental" nodes
Forward walk
Reverse walk
- Forward schema
- Reverse schema
Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 0 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forks investigate and their node IDs)
minimal walks
A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge
prospective bool = True
when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed origin id)Related issues
Issues #126 #122 #125 #102 #124
@MatthewRalston thinks the path forward towards a graph format is in creating additional structural definitions. If i think through the relationships preserved among different incomplete and completely self-referential formats, they require associated metadata schemas, and the utility function of taking a table or metadata schematic input and generating a consistently hashable representation (the metadata header format, it's parser, and the table parsing functionality, as in these modules)...
kmerdb.graph
kmerdb.fileutil
kmerdb.parse
and references..i.e. "the format(s)"
And associated schemas...
This utility function wouldn't be part of the algorithm per-se, but it would be incident to that which is produced by virtue of the file-metadata-log (and this version-dataset pairing) thingawhosit. That's mostly contained in our __init__
, and associated module files for format access and associated value provided from features and solutions in future versions.
and tying that to a git sha256 hash, should be preserved with all nodes of a given wall or path
This issue has been tabled for the time being in favor of a cleaner UI and experience on the user end.
I want the user to understand the output and even ASCII styling (in absence of a rich.py dependency, which isn't needed)
I want the logfile and output directories (required to collect .kdb, .kdbg, .stats.txt, output.log etc)
I want the expanded help and usage statements, including the 'features' and 'steps' developed further.
And finally, I want the STDOUT to be extremely minimal and/or non-existent, in the profile and graph commands. OR the formatting should display the resulting stats clearly apart from the header.
Finally, readme overhaul
Okay, I've been working on some other features and needed documentation/UI overhauls. Delays pushed deadline back a few months, reprioritizing the assembly algorithm and possible numba/Python etc implementations of D2 metrics, more odds-ratio stuff on the horizon, more literature review and beginning to write a report and lit review on applications of kmer count matrices and distances to metagenomics and microbiomes.
Key Question
[[ walk file ]]
schema concepts
for format versions of course...
[ minimal walks ]
solutional path
[[ solutional path file ]]
Header metadata will have the source and the parameters in the header. And a walk id - (a sha256 of the walk) for an associated walk file, and walk name (given at "runtime" via CLI). May be 0 to represent unspecific or unqualified walk (origin unclear)
126 #122 #125 #102 #124
sidenote
The neighbor structure 🌪️is manifested by particular kmer IDs🌬️, which may be accessed from kmer arrays loaded alongside the edge list during a path producing process.
A working pipeline would include all components of the workflow onto the next step but all commands are partial. Schemas' in planning stage for future release