Closed donaldcampbelljr closed 4 months ago
This addresses #1
Initial testing shows that the original uniwig .bw
are much, much bigger than the .wig
to .bw
files that this rust-based version is creating (and the process is taking much longer), so I will need to investigate the discrepancies.
I already see some room for improvement where the signal tracks are not properly subtracted on the tail end,e.g.
Ok, this is now much better:
Still seeing a shift between using the two methods.
Currently, this Rust implementation creates 3 wiggle files which I then send to wigToBigWig
.
Here's an example of the three wiggles given the input bed file (renamed with extensions .txt so github accepts them) _core.txt _end.txt _start.txt test5bed.txt
Perhaps, my wiggle header format is not proper and is why the shift is wrong?
Zooming in to show the start positions in IGV:
It also seems like the end.wig
is getting truncated somehow?
I applied some smoothing at the closing out of the counting, and it helps:
Both samples are smoothed with an int value of 5
and reported at the 1 bp
level. There is still a bit of a shift between the original method and the new that should be understood.
It appears that the rust version is shifted left for the core, but then shifted right for the start and end?
Looking better....
Looking better....
Can you make the y axis scales the same? It seems like they might be different. that could help with visualization
I've tweaked the Rust version so its not quite 1:1 of the original C algorithm but has allowed the Rust output to be closer to the final output from C Uniwig. However, there is a critical step in the C code that hands off the values to the external library:
This could also account for some discrepancy between the outputs as well.
Oh, here are some max values:
Yeah that's pretty dang close 😅
Notes from 04/23/2024 infrastructure meeting:
Ok, I played around with this a bit more and adjusted the rust version again. I noticed that the original C version is actually counting through the first coordinate which accounts for the mismatch in the value in total account. So now the outputs match very well.
However, you'll notice that there still seems to be a one basepair shift in the final bigwig file. So, I spent some time investigating this.
Bed file snippet
chr1 248956314 248956322
chr1 248956316 248956323
chr1 248956318 248956324
Rust
BEGIN smooth_Fixed_Start_End_Wiggle
DEBUG: START SITE BEFORE ADJUSTMENT -> 248956314
DEBUG: START SITE AFTER ADJUSTMENT -> 248956309
DEBUG: INITIAL ENDSITE -> 248956320
DEBUG: SKIPPING UNTIL COORDINATE_POSITION < ADJUSTEDSTARTSITE -> 1 248956309
DEBUG: SKIPPING UNTIL COORDINATE_POSITION < ADJUSTEDSTARTSITE -> 248956309 248956309
DEBUG: BEGIN COORDINATE ITERATION
DEBUG: Coordinate Value: 248956316, Adjusted Start Site: 248956311, New Endsite: 248956322
DEBUG: Reporting count: 1 at position: 248956309 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956309 -> 248956310
DEBUG: Reporting count: 1 at position: 248956310 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956310 -> 248956311
DEBUG: Reporting count: 1 at position: 248956311 for adjusted start site: 248956311
DEBUG: Incrementing coordinate_position: 248956311 -> 248956312
DEBUG: BEGIN COORDINATE ITERATION
C
chr1 - uniwig with size 248956422 - start
C DEBUG: Skipping until count_index < cutsize 1 248956309
CAUGHT UP TO FIRST CUTSITE
C DEBUG: Skipping until count_index < cutsize 248956309 248956309
C DEBUG BEGIN COORDINATE (ITERATOR) ITERATION
C DEBUG CUTSITE (COORDINATE VALUE) BEFORE SMOOTHING 248956316
C DEBUG CUTSITE AFTER SMOOTHING 248956311
C DEBUG New Endsite 248956322
C DEBUG Reporting count: 1 at position248956309 for cutsite 248956311
248956309 - 248956309
1
bw library entries -> Start: 248956309 valp: 1 n: 1
Incrementing COUNTINDEX 248956309 to 248956310
C DEBUG Reporting count: 1 at position248956310 for cutsite 248956311
248956310 - 248956310
1
bw library entries -> Start: 248956310 valp: 1 n: 1
Incrementing COUNTINDEX 248956310 to 248956311
Incrementing Iterator
The TLDR is that we are reporting beginning at coordinate 248956309
for both codes. I also checked to ensure that the values being based to the C external library wasn't being adjusted in transit. They are not: bw library entries -> Start: 248956309 valp: 1 n: 1
fixedStep chrom=chr1 start=248956309 step=1
1
1
2
2
From official reading, no. Our wiggle file is reporting a count starting at 309
not 308
(which is what IGV is showing).
Let's do a reality check using pyBigWig to read these .bw files produced by og and new methods. We can read the interval beginning right before the counting was to begin (248956307 instead of 248956309). We can print out the values and the mean value...
import pyBigWig
bw = pyBigWig.open("test_start.bw")
print(bw.chroms())
print(bw.header())
print(bw.stats("chr1"))
print(bw.values("chr1", 248956307, 248956422))
bw_rust = pyBigWig.open("start_rust_bed4_003.bw")
print(bw_rust.chroms())
print(bw_rust.header())
print(bw_rust.stats("chr1"))
print(bw_rust.values("chr1", 248956307, 248956422))
Output:
c
{'chr1': 248956422}
{'version': 4, 'nLevels': 1, 'nBasesCovered': 114, 'minVal': 0, 'maxVal': 8, 'sumData': 121, 'sumSquared': 665}
[1.0707964601769913]
[nan, nan, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 5.0, 4.0, 3.0, 3.0, 3.0, 2.0, 2.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
rust
{'chr1': 248956422}
{'version': 4, 'nLevels': 2, 'nBasesCovered': 28, 'minVal': 1, 'maxVal': 8, 'sumData': 121, 'sumSquared': 665}
[4.321428571428571]
[nan, 1.0, 1.0, 2.0, 2.0, 3.0, 4.0, 5.0, 5.0, 6.0, 7.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 5.0, 4.0, 3.0, 3.0, 3.0, 2.0, 2.0, 1.0, 1.0, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]
Aha, so in the translation from wiggle to bw via wigToBigWig
a shift is happening!
Other discrepancies that are important:
Just read through all this, and this is great. If I am understanding correctly, it seems the issue is that wigToBigWig shifts the outputted wiggle file from the original C version over twice (or inserts two nan
s), but wigToBigWig shifts the outputted wiggle file from the Rust version over once (or inserted one nan
).
Two questions:
Just to clarify, it is 1 shift with a NaN vs no shift. The counting starts at *309
but I looked at the interval starting at *307
just to see where the values begin (because IGV doesn't show them in the browser?). So it appears as though wigToBigWig
is shifting the output to the left (lower coordinates) by 1 for some reason. I'd like to confirm this with some more testing to make sure.
To answer your second question, I'm a bit worried about the NaN vs Zeros because pyBigWig
calculates different numbers for statistics (specfically the mean in the above example). However, I looked at how we are currently using uniwig's bw
files in geniml:
def process_bigwig(file, seq, p, chrom, chrom_size, normalize=True, mode=None):
"""Preprocess bigWig file"""
if pyBigWig.numpy:
track = file.values(chrom, 0, chrom_size, numpy=True)
else:
track = file.values(chrom, 0, chrom_size)
track = np.array(track)
track[np.isnan(track)] = 0
track = track.astype(np.uint16)
if normalize:
norm(track, mode)
seq[:, p] = track
And it appears as though the first step after loading the bw
using pyBigWig is to replace the NaNs with zeros... So I believe we are in the clear with the advantage being that the rust output is much smaller for counts occurring at the beginning of chromosomes (since the endings are not padded with zeros).
I will play around with this a bit more using wigToBigWig
and see if I can replicate it for other wiggle files.
I think it was decided to just keep using wiggle and big wig files for compatibility, however there is a rust crate that will let you dump an array to a .npy
file, which can then be instantly read into python. So, it circumvents the entire process of bigWig conversion since the end goal is just getting the data out anyone, right?
Write the data in Rust:
use ndarray::array;
use ndarray_npy::write_npy;
let arr = array![[1, 2, 3], [4, 5, 6]];
write_npy("array.npy", &arr)?;
Read into python:
import numpy as np
track = np.load("array.npy")
Just an idea
Oh yeah. I think your suggestion is the way to go to simplify things. I'll implement that as a file export option.
I guess you'd just need to export an array per chromosome, right? So 70 ish files... start, end, and, core for each chromosome?
You could even maintain backwards compatibility and let the process_bigwig
function accept either file types?
def process_bigwig(file, seq, p, chrom, chrom_size, normalize=True, mode=None):
"""Preprocess bigWig file"""
ext = pathlib.Path(file).suffix:
if ext == ".npy"
# process as numpy array
else:
# assume bigwig and process that way.
One downside to this is that the ndarray-npy
crate doesn't seem like its very actively maintained
I believe this may be some sort an indexing issue here, 1 vs 0 based indexing for these tools.
The bedGraph format, like all BED-based formats and most file formats used by UCSC, use "0-start, half-open" coordinates, but the wiggle ASCII text format for variableStep and fixedStep data uses "1-start, fully-closed" coordinates. Wiggle (variableStep and fixedStep) is the only format defined by UCSC that uses a 1-based format, for historical reasons. For example, for a chromosome of length N, the first position is 1 and the last position is N. For more information,
Some related reading: https://www.biostars.org/p/384606/
I used one of the other utilities (bigWigToWig
) to convert our bw
to wiggles
to see if the outputs are as expected.
uniwig C:
fixedStep chrom=chr1 start=248956310 step=1 span=1
1
uniwig Rust:
fixedStep chrom=chr1 start=248956309 step=1 span=1
1
Was expecting them uniwig to have a start position at 309 and rust uniwig to have a position of 308!
I agree that the above crate does not inspire confidence with the outdated dependencies, etc. Perhaps, there are other paths: https://github.com/pola-rs/pyo3-polars
Yeah, polars
is an option for sure. And, I like that idea a lot since it slots so nicely into our Rust/Python/pyo3
ecosystem. Two issues:
polars
cranks the compile time up tremendouslypandas
based, not polar
based.Also, it doesn't let us save files to disk (right?) So we kinda lose the value of independent uniwig
, I guess.
Per today's discussion:
This now supports creating .npy
files for each chrom and associated meta files for starts, ends, and core:
Each meta file holds the header information. Currently, it is the exact same as that which would be placed into the .wig
file:
Above examples created using this test bedfile:
chr11 10 50
chr11 20 76
chr12 769 2395
chr13 771 3000
chr14 800 2900
chr21 1 30
chr21 2 19
chr21 16 31
Now would be a great time to try all this out, because @ClaudeHu needs to create a universe with uniwig
, but perhaps he should try this implementation first. @donaldcampbelljr Is this ready to be tested?
@donaldcampbelljr @ClaudeHu and I found a small "user experience" thing. Claude was trying to run this with an invalid combinedbed
file and the error message that we were getting was: Error reading chromosome sizes: ...
However, it sent us on the wrong path because the program doesn't seem to use that anyways and just goes for the combinedbed
argument.
We got it all sorted, but had to follow the breadcrumbs for a bit, so maybe we could update that error message if we choose to stick with keeping that method of identifying chrom sizes.
Also this seems useful: https://stackoverflow.com/questions/42275777/how-to-trace-the-cause-of-an-error-result
@donaldcampbelljr one more thing...
@ClaudeHu found that when he provided an incorrect outputtype
argument, it said it was defaulting to wiggle
but instead did nothing. Looking at this match:
match output_type { }
It seems like it just does nothing if a non-standard output type is provided, it just prints a misleading statement:
_ => {println!("Default to wig file.")},
Should it panic!
?
One more.....
In fn write_to_wig_file
: should we create any necessary parent directories for the output file if they don't exist yet?
let mut file = OpenOptions::new()
.create(true) // Create the file if it doesn't exist
.append(true) // Append data to the existing file if it does exist
.open(filename).unwrap();
@donaldcampbelljr one more thing...
@ClaudeHu found that when he provided an incorrect
outputtype
argument, it said it was defaulting towiggle
but instead did nothing. Looking at this match:match output_type { }
It seems like it just does nothing if a non-standard output type is provided, it just prints a misleading statement:
_ => {println!("Default to wig file.")},
Should it
panic!
?
I've now added it such that it will default to writing npy
files appropriately.
One more.....
In
fn write_to_wig_file
: should we create any necessary parent directories for the output file if they don't exist yet?let mut file = OpenOptions::new() .create(true) // Create the file if it doesn't exist .append(true) // Append data to the existing file if it does exist .open(filename).unwrap();
Yeah, I solved this in the IGD branch. I've added the functionality here as well. However, I'd like to move this into common functions once both branches have been merged down to dev.
Added better error propagation.
I also made the chromsize reference optional.
I found that the code was already ignoring the chrom.sizes file and simply using the combined bed file to determine the chromosome size from the last end position of the chromosome.
Now, the user can use either approach. Using the chromsize ref path for smaller bed files with regions far upstream to the chromosome end will be slower as uniwig will pad out the remaining values with 0s until it reaches the end of the chrom.
Currently, it is unclear if we actually need to know the end of the chromosome for uniwig's downstream uses.
Attempt adding uniwig to genimtools.