databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
2 stars 1 forks source link

Implement `uniwig` #1

Open nleroy917 opened 8 months ago

nleroy917 commented 8 months ago

To enable universe creation, we will need to port uniwig over to this package and offer it up as a cli, a library crate interface, and ideally a python interface.

nleroy917 commented 8 months ago

@edward9065 is going to try to implement this. @nsheff he will need help with the algorithm if at all possible. Is there pseudocode anywhere? Or an algorithm figure?

donaldcampbelljr commented 4 months ago

Per discussion, much of the key code that needs to be ported is here: https://github.com/databio/uniwig/blob/master/src/uniwig.cpp

I noticed that uniwig relies on a C library, libBigWig, but there appears to be a Rust-based tool that is available (and in preprint!) that may help with this port:

https://github.com/jackh726/bigtools https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10871241/

donaldcampbelljr commented 4 months ago

Opened a PR to begin reviewing WIP.

Where this is currently 'stuck':

Example code:

        let mut chrom_map = HashMap::new();
        chrom_map.insert("chr17".to_string(), 83257441);

        let vals_iter = BedParser::from_bed_file(file);
        let vals = BedParserStreamingIterator::new(vals_iter, true);

        let mut out = BigWigWrite::create_file(file_names[0].clone());

        out.write(chrom_map, vals, runtime).unwrap();

Original code bins regions using smoothFixedStartEndBW before calling libBigWig func bwAddIntervalSpanSteps to write to bigwig file. I had hoped to replicate that here. However, I may need to create a new struct that implements the proper traits/Values such that the BigWigWrite functions can be used properly.

nleroy917 commented 4 months ago

However, when attempting to write to a bigWig file after using the built in BedParser, I get a type mismatch (Value vs BedEntry)

I need to really look into it, but this kind of sounds like an error in their library? Or should we implement the Write trait for the BedEntry structs? I'm probably not understanding fully, though.

donaldcampbelljr commented 4 months ago

See the return types for these two functions: https://github.com/jackh726/bigtools/blob/ccc884904f2c210e143f118dc2cf268e94723f13/bigtools/src/bed/bedparser.rs#L21-L80

Then, when you go to use the write function, it requires ChromData<Value=Values> https://github.com/jackh726/bigtools/blob/ccc884904f2c210e143f118dc2cf268e94723f13/bigtools/src/bbi/bigwigwrite.rs#L144-L147

donaldcampbelljr commented 4 months ago

Per discussion, we should rethink writing to a bigwig file as we do not need to use these files in the genome browser. Instead, this implementation should focus on taking either a combined bed file or a directory of bed files and create something similar to a wiggle file,i.e. do not worry about capturing the libBigWig functionality or attempting to implement items from bigtools. We should investigate using our own gtok file format or potentially a zarr format.

For inspiration of basic algorithm in Rust: https://github.com/databio/rustwig/blob/master/src/exact.rs

donaldcampbelljr commented 3 months ago

I've ported the core functionality from the above rustwig repository. genimtools::uniwig can now count starts and/ends if given a single/sorted bed file.

We should determine what output file we want. I believe this to be higher priority before proceeding with covering the other gaps (sorted vs unsorted, reading a list of beds instead of a single, etc).

We've discussed implementing zarr, though I haven't yet looked at the various Rust implementations to check their maturity. https://zarr.dev/implementations/

This project had a release as recently as March 2024: https://github.com/LDeakin/zarrs However, their github page warns that the repository is not production ready.

A simple, short term option, could be to make the output BED-like, similar to a bedgraph file, e.g.

file1.unibed

chromA  chromStartA  countValue
chromA  chromStartB  countValue

Open to other suggestions, especially if there is already an existing file format that makes more sense.

nleroy917 commented 3 months ago

This is great! Are you ready for a review of the code?

donaldcampbelljr commented 3 months ago

Not quite yet. Earlier today, we discussed potentially just converting these to wig files in the short term, so I'll look into writing these arrays to some file type first and then, as a first pass, this would be ready to merge into dev.

nleroy917 commented 3 months ago

Gotcha 👍🏼 Just lmk