Open nleroy917 opened 1 year ago
@edward9065 is going to try to implement this. @nsheff he will need help with the algorithm if at all possible. Is there pseudocode anywhere? Or an algorithm figure?
Per discussion, much of the key code that needs to be ported is here: https://github.com/databio/uniwig/blob/master/src/uniwig.cpp
I noticed that uniwig relies on a C library, libBigWig
, but there appears to be a Rust-based tool that is available (and in preprint!) that may help with this port:
https://github.com/jackh726/bigtools https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10871241/
Opened a PR to begin reviewing WIP.
Where this is currently 'stuck':
Example code:
let mut chrom_map = HashMap::new();
chrom_map.insert("chr17".to_string(), 83257441);
let vals_iter = BedParser::from_bed_file(file);
let vals = BedParserStreamingIterator::new(vals_iter, true);
let mut out = BigWigWrite::create_file(file_names[0].clone());
out.write(chrom_map, vals, runtime).unwrap();
Original code bins regions using smoothFixedStartEndBW
before calling libBigWig func bwAddIntervalSpanSteps
to write to bigwig file. I had hoped to replicate that here. However, I may need to create a new struct that implements the proper traits/Values such that the BigWigWrite
functions can be used properly.
However, when attempting to write to a bigWig file after using the built in BedParser, I get a type mismatch (Value vs BedEntry)
I need to really look into it, but this kind of sounds like an error in their library? Or should we implement the Write
trait for the BedEntry
structs? I'm probably not understanding fully, though.
See the return types for these two functions: https://github.com/jackh726/bigtools/blob/ccc884904f2c210e143f118dc2cf268e94723f13/bigtools/src/bed/bedparser.rs#L21-L80
Then, when you go to use the write function, it requires ChromData<Value=Values>
https://github.com/jackh726/bigtools/blob/ccc884904f2c210e143f118dc2cf268e94723f13/bigtools/src/bbi/bigwigwrite.rs#L144-L147
Per discussion, we should rethink writing to a bigwig file as we do not need to use these files in the genome browser. Instead, this implementation should focus on taking either a combined bed file or a directory of bed files and create something similar to a wiggle file,i.e. do not worry about capturing the libBigWig functionality or attempting to implement items from bigtools
. We should investigate using our own gtok
file format or potentially a zarr
format.
For inspiration of basic algorithm in Rust: https://github.com/databio/rustwig/blob/master/src/exact.rs
I've ported the core functionality from the above rustwig repository. genimtools::uniwig
can now count starts and/ends if given a single/sorted bed file.
We should determine what output file we want. I believe this to be higher priority before proceeding with covering the other gaps (sorted vs unsorted, reading a list of beds instead of a single, etc).
We've discussed implementing zarr, though I haven't yet looked at the various Rust implementations to check their maturity. https://zarr.dev/implementations/
This project had a release as recently as March 2024: https://github.com/LDeakin/zarrs However, their github page warns that the repository is not production ready.
A simple, short term option, could be to make the output BED-like
, similar to a bedgraph file, e.g.
file1.unibed
chromA chromStartA countValue
chromA chromStartB countValue
Open to other suggestions, especially if there is already an existing file format that makes more sense.
This is great! Are you ready for a review of the code?
Not quite yet. Earlier today, we discussed potentially just converting these to wig
files in the short term, so I'll look into writing these arrays to some file type first and then, as a first pass, this would be ready to merge into dev.
Gotcha 👍🏼 Just lmk
This still needs python bindings before it can properly be closed.
What do we want the ergonomics to be like here?
from gtars.uniwig import uniwig
uniwig(
combined_bed="/path/to/file.bed",
smooth_size=10,
step_size=5,
...
)
Could it be that simple? Does it need a python interface then?
Once we release the newest uniwig changes to master branch, we can close this issue and create new ones that are more specific to future enhancements.
To enable
universe
creation, we will need to portuniwig
over to this package and offer it up as acli
, a library crate interface, and ideally a python interface.