fgvieira / prune_graph

Fast prunning of graphs based on node weights
0 stars 0 forks source link

Core dumped #2

Open Daeap opened 1 year ago

Daeap commented 1 year ago

Hi,

thank you for this significantly faster pruning option.

Unfortunately I have problems on ~larger files (or too many lines?)~ some but not all files and they usually end up with a dumped core. This is sample command: ~/scripts/prune_graph/target/release/prune_graph --in snps.ld.chr1.trunc --weight-field column_7 --weight-filter "column_3 <= 50000 && column_7 >= 0.1" --out chr1.unlinked.pos

This is the error:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow

I use ngsLD files as input, either unedited or only the relevant columns: chr1:29495 chr1:29651 156 0.093005 -0.020154 1.000000 0.027559 chr1:30862 chr1:32349 1487 0.065533 -0.012921 0.999999 0.016464 chr1:25380 chr1:27944 2564 0.633290 0.075310 0.999990 0.724847 chr1:32386 chr1:32677 291 0.996459 0.100407 1.000000 1.000000 chr1:30862 chr1:32386 1524 0.891505 0.092751 1.000000 0.914650 chr1:25380 chr1:29403 4023 0.123450 0.063818 0.999946 0.167704

System has 1TB+ RAM, 64 cores. Dependencies were installed with conda.

I have no real knowledge of Rust so there's not much I can add here or try to fix myself.

Best regards

David

fgvieira commented 1 year ago

Do you have a small example file so I can replicate the error?

Daeap commented 1 year ago

How can I send you a file? While it is not necessarily file size dependent, I have truncated one of the files I have to the first 150 million lines and it seems to be happening there abouts, the first 100 Million run fine, but then it gets the same error. The file is still 1.6Gb after gzipping. Could you give me an E-Mail so I could share it per WeTransfer, if you're comfortable with that?

fgvieira commented 1 year ago

Some services allow to just create a link (e.g. https://www.transfernow.net/en)

Daeap commented 1 year ago

Here is the link: https://www.transfernow.net/dl/20230627UMkOGefe

It's an ngsLD output file, with only the first 3 columns and the 7th column. And these are only the first 150M lines.

jcaccavo commented 1 year ago

Hi there,

I'm experiencing the same issue. I also used the ngsLD output file as input to prune_graph, unedited.

This is the command I used: /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/prune_graph/target/release/prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id

This is the error I get:

thread 'main' has overflowed its stack
fatal runtime error: stack overflow
24_identify_unlinked_SNPs.sh: line 15:  2046 Aborted                 /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/prune_graph/target/release/prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id

A zipped version of the ngsLD output file that I am using as input can be downloaded from my dropbox. The file is 12.72 GB.

The head of the file looks as follows:

HiC_scaffold_10:79542   HiC_scaffold_10:79543   1   0.793716    0.071503    1.000000    0.999345
HiC_scaffold_10:78110   HiC_scaffold_10:78111   1   0.996641    0.044456    0.999972    0.999935
HiC_scaffold_10:79542   HiC_scaffold_10:79544   2   0.999996    0.056655    0.999909    0.999817
HiC_scaffold_10:78112   HiC_scaffold_10:78113   1   0.924173    0.043201    0.999983    0.999952
HiC_scaffold_10:79542   HiC_scaffold_10:79545   3   0.936968    0.056903    0.999909    0.999794
HiC_scaffold_10:78113   HiC_scaffold_10:78119   6   0.877176    0.043487    0.999982    0.999943
HiC_scaffold_10:79542   HiC_scaffold_10:79548   6   0.988844    0.056420    0.999907    0.999814
HiC_scaffold_10:78108   HiC_scaffold_10:78109   1   0.913702    0.044911    0.999965    0.999909
HiC_scaffold_10:78111   HiC_scaffold_10:78112   1   0.999846    0.044405    0.999973    0.999945
HiC_scaffold_10:78102   HiC_scaffold_10:78108   6   0.877358    0.044839    0.999957    0.999900

I am running on an HPC, and I also installed dependencies using conda.

Similarly to David, I am not familiar with Rust and so it is not clear to me how to resolve this issue.

Thanks in advance for your help, Jilda

fgvieira commented 1 year ago

I'm trying to look into this issue, but I have been a bit busy. It seems to be a memory issue when running prune_graph with very big files, so it is not easy to debug.

jcaccavo commented 1 year ago

I totally understand - thanks for your efforts, and I'll stay tuned in case a solution does arise.

fgvieira commented 1 year ago

It seems it was because of petgraph/petgraph#399. I've changed it to use kosaraju_scc; can you give it a try with the latest commit?

jcaccavo commented 1 year ago

Hi there,

Thanks so much for pursuing this issue, and apologies for my delayed feedback.

I tried running prune_graph on the smallest of my 4 datasets (52 GB).

prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD.gz --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg.id RUST_BACKTRACE=1

This worked, but it took a whopping 10 days to run on our server! Using the perl script (prune_graph.pl) to identify unlinked SNPs from the same LD file, it took <24 hours. It seems strange that prune_graph, which should be more efficient than the perl script, takes 10x as long.

I think there may be in an issue in either the code or the parameters I used, because after trimming the linked SNPs from my original beagle file using the unlinked SNP ID file generated from prune_graph, I get far fewer SNPs remaining in my dataset (5090) than when I used the perl script (219,149). Using the perl script, I entered the following code and parameters:

perl /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/ngsLD/scripts/prune_graph.pl --in_file dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --max_kb_dist 20 --min_weight 0.5 --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id

I think an issue might be that with the perl script, I can modify the max_kb_dist, whereas with prune_graph, I don't see a way to do that.

In short: 1) prune_graph is now working for me, but 2) the processing time is 10x longer than with the prune_graph perl script, and 3) I'm ending up with far more SNPs identified as linked, which could be related to not being able to input the max_kb_dist

So my questions are: 1) Is it normal that prune_graph should take so much longer to run than the perl script? and 2) Is there a way to input the max_kb_dist to consider SNPs linked when running prune_graph (as is possible with the perl script)?

Thanks in advance for your feedback!

Best, Jilda

fgvieira commented 1 year ago

prune_graph is taking so long probably because you are not filtering by maximum SNP distance, while in the perl script you are using a maximum distance of 20kb.

1. Is it normal that prune_graph should take so much longer to run than the perl script? and

No.

2. Is there a way to input the max_kb_dist to consider SNPs linked when running prune_graph (as is possible with the perl script)?

Yes, but it depends on your input file. If the file is the same as above, then the prune_graph command should be something like:

prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD.gz --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg.id

This should only keep pairs of sites with a distance lower than 20kb (column_3 < 20000) and an LD higher than 0.5 (column_7 > 0.5), and use column_7 as the weight (not column_3 which is the distance between sites).

In the future, you can run ngsLD with an output header (--outH) so the output is easier to read.

EDIT: the output from the perl and rust implementations will not give exactly the same results, but the number of SNPs after pruning should be similar. Also, double check the output to make sure that only sites following those conditions were included.

jcaccavo commented 1 year ago

Thank you so much for this feedback! I indeed used your suggestions to run the following code, which took 15 hours to run:

prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg2.id

After trimming my original beagle.gz file based on the linked SNPs identified in this prune_graph run, the resultant beagle.gz file had 808,371 SNPs remaining. This is contrast to 219,149 SNPs remaining when using the same parameters and the perl script with the following code:

perl /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/ngsLD/scripts/prune_graph.pl --in_file dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --max_kb_dist 20 --min_weight 0.5 --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id

I saw in the edit to your response that the outputs will not be exactly the same from prune_graph (rust) and perl. Is it normal though that the resultant number of linked SNPs identified have this wide of a margin of difference though? For perl, 1,720,847 linked SNPs were identified, and for prune_graph (rust) it was 1,311,625, which is a difference of 589,222 SNPs between the two methods.

It's true that the order of magnitude is the same (1 million/1 million), but I just wanted to be sure that this discrepancy of 589,222 SNPs is within the expected range, and not indicative of something wrong.

Thanks so much for your help!

fgvieira commented 1 year ago

Glad to hear things are starting to make sense! 😄

Just for the sake of completeness:

The two outputs are not expected to be exactly the same due to precision issues, but the difference should not be that large. How many common SNPs between the 808,371 (rust) and 219,149 (perl) remaining SNPs?

EDIT: have you also tried the python script? Do you get the same?

jcaccavo commented 11 months ago

Thanks as always for your patience in working with me to get prune_graph to work with my data.

I just re-ran prune_graph on the same dataset with the min_weight parameter modified from column_7 > 0.5 to column_7 >= 0.5. It only took about 2 hours to run!

.../prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 >= 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id

The resultant number of unlinked SNPs is now 488,972, down from 808,371 with column_7 > 0.5. This still over 2x more SNPs remaining than the perl script run, which resulted in 219,149 SNPs, i.e. half as many linked SNPs are being identified with prune_graph as with the perl script, with the same parameters. Do you think the difference in the processing of weights could be resulting in this discrepancy?

How would you recommend checking for common SNPs between two beagle.gz files? I did some searching about how to do this but didn't find any great answers. I tried zgrep -Fxf FILE1.beagle.gz FILE2.beagle.gz, but it was taking over an hour so I just killed the process. Then I tried zgrep -c FILE1.beagle.gz FILE2.beagle.gz, and I got 0, which doesn't make sense..

Unfortunately I never managed to run the python script. There was a required module (graph-tool) that I tried installing with conda, but the environment could never be solved in the installation, so I have up and I went back to trying to use prune_graph.

fgvieira commented 11 months ago

I'd say that double the number, is a bit too much...

Just compare the output files from both methods (no need to compare the beagles). You can use grep or just:

cat prune_graph.rust.out prune_graph.perl.out  | sort | uniq -c
jcaccavo commented 11 months ago

Ah of course that makes sense.. using grep -Fxf dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id

219,136 SNPs were identified as shared, which is nearly every single unlinked SNP identified with the perl script (219,149). It seems then that prune_graph is identifying an addition 250k unlinked SNPs on top of those.

I'm attaching the two output files: dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id.txt generated with the perlscript, and dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id.txt generated with prune_graph, as well as the shared SNPs file shared_SNPs_2x_perl_vs_rust_output3.txt, in case that might be helpful.

fgvieira commented 11 months ago

Did you do column_3 < 20000 or column_3 <= 20000? If the former, can you try the later (just so it is the most comparable to the perl script)?

The prune_graph output file seems to be the old one, since it has 808265 lines.

Do you think you could generate a smaller dataset reproducing this issue? Debugging a 12.7 Gb file is kind of hard... 😄

jcaccavo commented 11 months ago

I had used column_3 < 20000. I have just run again the same dataset with column_3 <= 20000. Actually, I went back and checked the final remaining unlinked SNPs numbers, and I realize I had made an error. The "488,972" final unlinked SNP number was from another dataset processed with the perl script. Indeed, for the dataset in question, for each analysis, the results were the following:

Remaining unlinked SNPs perl script min_weight 0.5, max_kb_dist 20k: 219,149 rust < 20000 > 0.5: 808,371 rust < 20000 >= 0.5: 808,266 rust <= 20000 >= 0.5: 808,261

So indeed, the slight tweaks of the rust analysis parameters did not markedly change the final SNP numbers.. They remain about 4x greater than the perl script output.

I ran a similar set of analyses on another (larger) dataset, and the results were as follows:

Remaining unlinked SNPs perl script min_weight 0.5, max_kb_dist 20k: 488,972 rust < 20000 >= 0.5: 2,196,255 rust <= 20000 >= 0.5: 2,196,238

Again with this dataset, the perl script identified 4x more SNPs as linked than rust. Indeed, with the perl script, 11% and 9% of SNPs are retained as unlinked for both datasets, and with rust, 42% and 41% of SNPs are retained as unlinked. This at least seems somewhat consistent to me.

I am happy to move forward using rust for all of my linkage analyses. Indeed, it runs much more quickly, gives me a lot more SNPs to work with in the end, and works for all 4 of my datasets (whereas previously, my largest dataset could not be processed with the perl script). I guess I would just feel more confident if I understood why the perl script and rust were behaving so differently with my data.

I would be happy to provide a smaller dataset to reproduce this issue, I'm just not sure how to go about doing this. I produced the LD input files for the perl script and rust using my beagle files produced in ANGSD. I believe that e.g. with samtools it should be possible to randomly subsample a bam file to make a 'smaller, random sample' of the data. Would one method be to sub-sample randomly to a smaller size all of my bam files that I used to produce the beagle file for a given dataset in ANGSD, and then this smaller beagle file would produce a smaller LD input file for perl and rust, etc.? Or is there perhaps a faster/quicker way?

Thanks so much for all of your help.

fgvieira commented 11 months ago

Thanks for looking so thoroughly into this. :smile:

It would be normal for the numbers to differ a bit (due to precision issues), but the difference should not be that big.

Can you send me a smaller file where I can reproduce the same pattern? A smaller dataset would be ideal but, if not possible, just get the first (e.g) 10000 lines from the beagle file and check if you see the same pattern.

jcaccavo commented 11 months ago

Thanks for the advice for producing the smaller file. Of my 4 datasets, I took the smallest one, and extracted the first 10000 lines from the beagle file as follows

zcat dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X.beagle.gz | head -n 10000 > dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.beagle.gz

You can download that 'mini' beagle file from my Dropbox here (the file is 11 MB).

Let me know if that works, or if there's anything else I can provide that would help you reproduce this pattern. Thanks!

fgvieira commented 11 months ago

Tried running your file as:

ngsLD --geno dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.beagle.gz --pos dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.pos --n_ind 39 --n_sites 9999 --extend_out --outH out.ld

prune_graph --in out.ld --header --weight-field r2 --weight-filter "dist < 20000 && r2 >= 0.5" --out out.rust

tail -n +2 out.ld > out.ldH
prune_ngsLD.py --input out.ldH --max_dist 20000 --min_weight 0.5 --out out.python

perl prune_graph.pl --in_file out.ldH --max_kb_dist 20 --min_weight 0.5 | sort > out.perl

And got exactly the same result for all files:

$ md5sum out.*
b5493199f5729aee71884448e3891b6e  out.perl
b5493199f5729aee71884448e3891b6e  out.python
b5493199f5729aee71884448e3891b6e  out.rust

Do you have a file where I can reproduce what you are seeing?