databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
3 stars 2 forks source link

Dev uniwig #11

Closed donaldcampbelljr closed 4 months ago

donaldcampbelljr commented 8 months ago

Attempt adding uniwig to genimtools.

donaldcampbelljr commented 8 months ago

This addresses #1

donaldcampbelljr commented 7 months ago

Initial testing shows that the original uniwig .bw are much, much bigger than the .wig to .bw files that this rust-based version is creating (and the process is taking much longer), so I will need to investigate the discrepancies.

I already see some room for improvement where the signal tracks are not properly subtracted on the tail end,e.g. image

donaldcampbelljr commented 7 months ago

Ok, this is now much better: image

Still seeing a shift between using the two methods.

Currently, this Rust implementation creates 3 wiggle files which I then send to wigToBigWig.

Here's an example of the three wiggles given the input bed file (renamed with extensions .txt so github accepts them) _core.txt _end.txt _start.txt test5bed.txt

Perhaps, my wiggle header format is not proper and is why the shift is wrong?

Zooming in to show the start positions in IGV: image

nleroy917 commented 7 months ago

It also seems like the end.wig is getting truncated somehow?

donaldcampbelljr commented 7 months ago

I applied some smoothing at the closing out of the counting, and it helps: image

Both samples are smoothed with an int value of 5 and reported at the 1 bp level. There is still a bit of a shift between the original method and the new that should be understood.

nleroy917 commented 7 months ago

It appears that the rust version is shifted left for the core, but then shifted right for the start and end?

donaldcampbelljr commented 7 months ago

Looking better.... image

nleroy917 commented 7 months ago

Looking better.... image

Can you make the y axis scales the same? It seems like they might be different. that could help with visualization

donaldcampbelljr commented 7 months ago

image

I've tweaked the Rust version so its not quite 1:1 of the original C algorithm but has allowed the Rust output to be closer to the final output from C Uniwig. However, there is a critical step in the C code that hands off the values to the external library: image

This could also account for some discrepancy between the outputs as well.

donaldcampbelljr commented 7 months ago

Oh, here are some max values: image

nleroy917 commented 7 months ago

Yeah that's pretty dang close 😅

nleroy917 commented 7 months ago

Notes from 04/23/2024 infrastructure meeting:

donaldcampbelljr commented 7 months ago

Ok, I played around with this a bit more and adjusted the rust version again. I noticed that the original C version is actually counting through the first coordinate which accounts for the mismatch in the value in total account. So now the outputs match very well.

image

However, you'll notice that there still seems to be a one basepair shift in the final bigwig file. So, I spent some time investigating this.

Debug Output

Bed file snippet

chr1    248956314   248956322
chr1    248956316   248956323
chr1    248956318   248956324

Rust

BEGIN smooth_Fixed_Start_End_Wiggle
DEBUG: START SITE BEFORE ADJUSTMENT -> 248956314
DEBUG: START SITE AFTER ADJUSTMENT -> 248956309
DEBUG: INITIAL ENDSITE -> 248956320
DEBUG: SKIPPING UNTIL COORDINATE_POSITION < ADJUSTEDSTARTSITE -> 1  248956309
DEBUG: SKIPPING UNTIL COORDINATE_POSITION < ADJUSTEDSTARTSITE -> 248956309  248956309
DEBUG: BEGIN COORDINATE ITERATION
DEBUG: Coordinate Value: 248956316, Adjusted Start Site: 248956311, New Endsite: 248956322 
DEBUG: Reporting count: 1 at position: 248956309 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956309  -> 248956310
DEBUG: Reporting count: 1 at position: 248956310 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956310  -> 248956311
DEBUG: Reporting count: 1 at position: 248956311 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956311  -> 248956312
DEBUG: BEGIN COORDINATE ITERATION

C

        chr1    - uniwig with size 248956422    - start
C DEBUG: Skipping until count_index < cutsize  1  248956309
CAUGHT UP TO FIRST CUTSITE
C DEBUG: Skipping until count_index < cutsize  248956309  248956309
C DEBUG BEGIN COORDINATE (ITERATOR) ITERATION
C DEBUG CUTSITE (COORDINATE VALUE) BEFORE SMOOTHING   248956316
C DEBUG CUTSITE AFTER SMOOTHING   248956311
C DEBUG New Endsite   248956322
C DEBUG Reporting count: 1 at position248956309 for cutsite  248956311
248956309 - 248956309
1
bw library entries ->  Start: 248956309 valp: 1  n:  1
Incrementing COUNTINDEX  248956309 to  248956310
C DEBUG Reporting count: 1 at position248956310 for cutsite  248956311
248956310 - 248956310
1
bw library entries ->  Start: 248956310 valp: 1  n:  1
Incrementing COUNTINDEX  248956310 to  248956311
Incrementing Iterator

The TLDR is that we are reporting beginning at coordinate 248956309 for both codes. I also checked to ensure that the values being based to the C external library wasn't being adjusted in transit. They are not: bw library entries -> Start: 248956309 valp: 1 n: 1

Is it the wiggle file?

fixedStep chrom=chr1 start=248956309 step=1
1
1
2
2

From official reading, no. Our wiggle file is reporting a count starting at 309 not 308 (which is what IGV is showing). image

Reality check with pyBigWig

Let's do a reality check using pyBigWig to read these .bw files produced by og and new methods. We can read the interval beginning right before the counting was to begin (248956307 instead of 248956309). We can print out the values and the mean value...

import pyBigWig

bw = pyBigWig.open("test_start.bw")

print(bw.chroms())
print(bw.header())
print(bw.stats("chr1"))
print(bw.values("chr1", 248956307, 248956422))

bw_rust = pyBigWig.open("start_rust_bed4_003.bw")
print(bw_rust.chroms())
print(bw_rust.header())
print(bw_rust.stats("chr1"))
print(bw_rust.values("chr1", 248956307, 248956422))

Output:

c

{'chr1': 248956422}
{'version': 4, 'nLevels': 1, 'nBasesCovered': 114, 'minVal': 0, 'maxVal': 8, 'sumData': 121, 'sumSquared': 665}
[1.0707964601769913]
[nan, nan, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 5.0, 4.0, 3.0, 3.0, 3.0, 2.0, 2.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

rust

{'chr1': 248956422}
{'version': 4, 'nLevels': 2, 'nBasesCovered': 28, 'minVal': 1, 'maxVal': 8, 'sumData': 121, 'sumSquared': 665}
[4.321428571428571]
[nan, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 5.0, 4.0, 3.0, 3.0, 3.0, 2.0, 2.0, 1.0, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Aha, so in the translation from wiggle to bw via wigToBigWig a shift is happening!

Other discrepancies that are important:

nleroy917 commented 7 months ago

Just read through all this, and this is great. If I am understanding correctly, it seems the issue is that wigToBigWig shifts the outputted wiggle file from the original C version over twice (or inserts two nans), but wigToBigWig shifts the outputted wiggle file from the Rust version over once (or inserted one nan).

Two questions:

  1. We still don't know why this is occurring?
  2. How much does this matter?
donaldcampbelljr commented 7 months ago

Just to clarify, it is 1 shift with a NaN vs no shift. The counting starts at *309 but I looked at the interval starting at *307 just to see where the values begin (because IGV doesn't show them in the browser?). So it appears as though wigToBigWig is shifting the output to the left (lower coordinates) by 1 for some reason. I'd like to confirm this with some more testing to make sure.

To answer your second question, I'm a bit worried about the NaN vs Zeros because pyBigWig calculates different numbers for statistics (specfically the mean in the above example). However, I looked at how we are currently using uniwig's bw files in geniml:

https://github.com/databio/geniml/blob/4657e53d8de0720c7041aff5f6143a4694b19f40/geniml/universe/hmm_universe.py#L47-L58

def process_bigwig(file, seq, p, chrom, chrom_size, normalize=True, mode=None):
    """Preprocess bigWig file"""
    if pyBigWig.numpy:
        track = file.values(chrom, 0, chrom_size, numpy=True)
    else:
        track = file.values(chrom, 0, chrom_size)
        track = np.array(track)
    track[np.isnan(track)] = 0
    track = track.astype(np.uint16)
    if normalize:
        norm(track, mode)
    seq[:, p] = track

And it appears as though the first step after loading the bw using pyBigWig is to replace the NaNs with zeros... So I believe we are in the clear with the advantage being that the rust output is much smaller for counts occurring at the beginning of chromosomes (since the endings are not padded with zeros).

Next Steps

I will play around with this a bit more using wigToBigWig and see if I can replicate it for other wiggle files.

nleroy917 commented 7 months ago

I think it was decided to just keep using wiggle and big wig files for compatibility, however there is a rust crate that will let you dump an array to a .npy file, which can then be instantly read into python. So, it circumvents the entire process of bigWig conversion since the end goal is just getting the data out anyone, right?

Write the data in Rust:

use ndarray::array;
use ndarray_npy::write_npy;

let arr = array![[1, 2, 3], [4, 5, 6]];
write_npy("array.npy", &arr)?;

Read into python:

import numpy as np

track = np.load("array.npy")

Just an idea

donaldcampbelljr commented 7 months ago

Oh yeah. I think your suggestion is the way to go to simplify things. I'll implement that as a file export option.

nleroy917 commented 7 months ago

I guess you'd just need to export an array per chromosome, right? So 70 ish files... start, end, and, core for each chromosome?

nleroy917 commented 7 months ago

You could even maintain backwards compatibility and let the process_bigwig function accept either file types?

def process_bigwig(file, seq, p, chrom, chrom_size, normalize=True, mode=None):
    """Preprocess bigWig file"""
    ext = pathlib.Path(file).suffix:
    if ext == ".npy"
        # process as numpy array
    else:
        # assume bigwig and process that way.
nleroy917 commented 7 months ago

One downside to this is that the ndarray-npy crate doesn't seem like its very actively maintained

donaldcampbelljr commented 7 months ago

I believe this may be some sort an indexing issue here, 1 vs 0 based indexing for these tools.

The bedGraph format, like all BED-based formats and most file formats used by UCSC, use "0-start, half-open" coordinates, but the wiggle ASCII text format for variableStep and fixedStep data uses "1-start, fully-closed" coordinates. Wiggle (variableStep and fixedStep) is the only format defined by UCSC that uses a 1-based format, for historical reasons. For example, for a chromosome of length N, the first position is 1 and the last position is N. For more information,

Some related reading: https://www.biostars.org/p/384606/

I used one of the other utilities (bigWigToWig) to convert our bw to wiggles to see if the outputs are as expected.

uniwig C:

fixedStep chrom=chr1 start=248956310 step=1 span=1
1

uniwig Rust:

fixedStep chrom=chr1 start=248956309 step=1 span=1
1

Was expecting them uniwig to have a start position at 309 and rust uniwig to have a position of 308!


I agree that the above crate does not inspire confidence with the outdated dependencies, etc. Perhaps, there are other paths: https://github.com/pola-rs/pyo3-polars

nleroy917 commented 7 months ago

Yeah, polars is an option for sure. And, I like that idea a lot since it slots so nicely into our Rust/Python/pyo3 ecosystem. Two issues:

  1. polars cranks the compile time up tremendously
  2. A lot of the data frame libraries we want to be using over in Bioconductor are pandas based, not polar based.

Also, it doesn't let us save files to disk (right?) So we kinda lose the value of independent uniwig, I guess.

donaldcampbelljr commented 6 months ago

Per today's discussion:

donaldcampbelljr commented 4 months ago

This now supports creating .npy files for each chrom and associated meta files for starts, ends, and core: image

Each meta file holds the header information. Currently, it is the exact same as that which would be placed into the .wig file: image

Above examples created using this test bedfile:

chr11   10  50
chr11   20  76
chr12   769 2395
chr13   771 3000
chr14   800 2900
chr21   1   30
chr21   2   19
chr21   16  31
nleroy917 commented 4 months ago

Now would be a great time to try all this out, because @ClaudeHu needs to create a universe with uniwig, but perhaps he should try this implementation first. @donaldcampbelljr Is this ready to be tested?

nleroy917 commented 4 months ago

@donaldcampbelljr @ClaudeHu and I found a small "user experience" thing. Claude was trying to run this with an invalid combinedbed file and the error message that we were getting was: Error reading chromosome sizes: ...

However, it sent us on the wrong path because the program doesn't seem to use that anyways and just goes for the combinedbed argument.

We got it all sorted, but had to follow the breadcrumbs for a bit, so maybe we could update that error message if we choose to stick with keeping that method of identifying chrom sizes.

Also this seems useful: https://stackoverflow.com/questions/42275777/how-to-trace-the-cause-of-an-error-result

nleroy917 commented 4 months ago

@donaldcampbelljr one more thing...

@ClaudeHu found that when he provided an incorrect outputtype argument, it said it was defaulting to wiggle but instead did nothing. Looking at this match:

match output_type { } 

It seems like it just does nothing if a non-standard output type is provided, it just prints a misleading statement:

_ => {println!("Default to wig file.")},

Should it panic!?

nleroy917 commented 4 months ago

One more.....

In fn write_to_wig_file: should we create any necessary parent directories for the output file if they don't exist yet?


    let mut file = OpenOptions::new()
        .create(true)  // Create the file if it doesn't exist
        .append(true)  // Append data to the existing file if it does exist
        .open(filename).unwrap();
donaldcampbelljr commented 4 months ago

@donaldcampbelljr one more thing...

@ClaudeHu found that when he provided an incorrect outputtype argument, it said it was defaulting to wiggle but instead did nothing. Looking at this match:

match output_type { } 

It seems like it just does nothing if a non-standard output type is provided, it just prints a misleading statement:

_ => {println!("Default to wig file.")},

Should it panic!?

I've now added it such that it will default to writing npy files appropriately.

donaldcampbelljr commented 4 months ago

One more.....

In fn write_to_wig_file: should we create any necessary parent directories for the output file if they don't exist yet?

    let mut file = OpenOptions::new()
        .create(true)  // Create the file if it doesn't exist
        .append(true)  // Append data to the existing file if it does exist
        .open(filename).unwrap();

Yeah, I solved this in the IGD branch. I've added the functionality here as well. However, I'd like to move this into common functions once both branches have been merged down to dev.

donaldcampbelljr commented 4 months ago

Added better error propagation.

I also made the chromsize reference optional.

I found that the code was already ignoring the chrom.sizes file and simply using the combined bed file to determine the chromosome size from the last end position of the chromosome.

Now, the user can use either approach. Using the chromsize ref path for smaller bed files with regions far upstream to the chromosome end will be slower as uniwig will pad out the remaining values with 0s until it reaches the end of the chrom.

Currently, it is unclear if we actually need to know the end of the chromosome for uniwig's downstream uses.