Open Daeap opened 1 year ago
Do you have a small example file so I can replicate the error?
How can I send you a file? While it is not necessarily file size dependent, I have truncated one of the files I have to the first 150 million lines and it seems to be happening there abouts, the first 100 Million run fine, but then it gets the same error. The file is still 1.6Gb after gzipping. Could you give me an E-Mail so I could share it per WeTransfer, if you're comfortable with that?
Some services allow to just create a link (e.g. https://www.transfernow.net/en)
Here is the link: https://www.transfernow.net/dl/20230627UMkOGefe
It's an ngsLD output file, with only the first 3 columns and the 7th column. And these are only the first 150M lines.
Hi there,
I'm experiencing the same issue. I also used the ngsLD output file as input to prune_graph, unedited.
This is the command I used:
/srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/prune_graph/target/release/prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id
This is the error I get:
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
24_identify_unlinked_SNPs.sh: line 15: 2046 Aborted /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/prune_graph/target/release/prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id
A zipped version of the ngsLD output file that I am using as input can be downloaded from my dropbox. The file is 12.72 GB.
The head of the file looks as follows:
HiC_scaffold_10:79542 HiC_scaffold_10:79543 1 0.793716 0.071503 1.000000 0.999345
HiC_scaffold_10:78110 HiC_scaffold_10:78111 1 0.996641 0.044456 0.999972 0.999935
HiC_scaffold_10:79542 HiC_scaffold_10:79544 2 0.999996 0.056655 0.999909 0.999817
HiC_scaffold_10:78112 HiC_scaffold_10:78113 1 0.924173 0.043201 0.999983 0.999952
HiC_scaffold_10:79542 HiC_scaffold_10:79545 3 0.936968 0.056903 0.999909 0.999794
HiC_scaffold_10:78113 HiC_scaffold_10:78119 6 0.877176 0.043487 0.999982 0.999943
HiC_scaffold_10:79542 HiC_scaffold_10:79548 6 0.988844 0.056420 0.999907 0.999814
HiC_scaffold_10:78108 HiC_scaffold_10:78109 1 0.913702 0.044911 0.999965 0.999909
HiC_scaffold_10:78111 HiC_scaffold_10:78112 1 0.999846 0.044405 0.999973 0.999945
HiC_scaffold_10:78102 HiC_scaffold_10:78108 6 0.877358 0.044839 0.999957 0.999900
I am running on an HPC, and I also installed dependencies using conda.
Similarly to David, I am not familiar with Rust and so it is not clear to me how to resolve this issue.
Thanks in advance for your help, Jilda
I'm trying to look into this issue, but I have been a bit busy. It seems to be a memory issue when running prune_graph
with very big files, so it is not easy to debug.
I totally understand - thanks for your efforts, and I'll stay tuned in case a solution does arise.
It seems it was because of petgraph/petgraph#399.
I've changed it to use kosaraju_scc
; can you give it a try with the latest commit?
Hi there,
Thanks so much for pursuing this issue, and apologies for my delayed feedback.
I tried running prune_graph on the smallest of my 4 datasets (52 GB).
prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD.gz --weight-field "column_3" --weight-filter "column_3 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg.id RUST_BACKTRACE=1
This worked, but it took a whopping 10 days to run on our server! Using the perl script (prune_graph.pl) to identify unlinked SNPs from the same LD file, it took <24 hours. It seems strange that prune_graph, which should be more efficient than the perl script, takes 10x as long.
I think there may be in an issue in either the code or the parameters I used, because after trimming the linked SNPs from my original beagle file using the unlinked SNP ID file generated from prune_graph, I get far fewer SNPs remaining in my dataset (5090) than when I used the perl script (219,149). Using the perl script, I entered the following code and parameters:
perl /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/ngsLD/scripts/prune_graph.pl --in_file dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --max_kb_dist 20 --min_weight 0.5 --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id
I think an issue might be that with the perl script, I can modify the max_kb_dist, whereas with prune_graph, I don't see a way to do that.
In short: 1) prune_graph is now working for me, but 2) the processing time is 10x longer than with the prune_graph perl script, and 3) I'm ending up with far more SNPs identified as linked, which could be related to not being able to input the max_kb_dist
So my questions are: 1) Is it normal that prune_graph should take so much longer to run than the perl script? and 2) Is there a way to input the max_kb_dist to consider SNPs linked when running prune_graph (as is possible with the perl script)?
Thanks in advance for your feedback!
Best, Jilda
prune_graph
is taking so long probably because you are not filtering by maximum SNP distance, while in the perl script you are using a maximum distance of 20kb
.
1. Is it normal that prune_graph should take so much longer to run than the perl script? and
No.
2. Is there a way to input the max_kb_dist to consider SNPs linked when running prune_graph (as is possible with the perl script)?
Yes, but it depends on your input file. If the file is the same as above, then the prune_graph
command should be something like:
prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD.gz --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg.id
This should only keep pairs of sites with a distance lower than 20kb
(column_3 < 20000
) and an LD higher than 0.5
(column_7 > 0.5
), and use column_7
as the weight (not column_3
which is the distance between sites).
In the future, you can run ngsLD
with an output header (--outH
) so the output is easier to read.
EDIT: the output from the perl
and rust
implementations will not give exactly the same results, but the number of SNPs after pruning should be similar. Also, double check the output to make sure that only sites following those conditions were included.
Thank you so much for this feedback! I indeed used your suggestions to run the following code, which took 15 hours to run:
prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 > 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg2.id
After trimming my original beagle.gz file based on the linked SNPs identified in this prune_graph
run, the resultant beagle.gz file had 808,371 SNPs remaining. This is contrast to 219,149 SNPs remaining when using the same parameters and the perl
script with the following code:
perl /srv/public/users/jcaccavo/11_CCGA_full_seq/02_NovaSeq/02_WG/x_scripts/ngsLD/scripts/prune_graph.pl --in_file dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --max_kb_dist 20 --min_weight 0.5 --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id
I saw in the edit to your response that the outputs will not be exactly the same from prune_graph
(rust
) and perl
. Is it normal though that the resultant number of linked SNPs identified have this wide of a margin of difference though? For perl
, 1,720,847 linked SNPs were identified, and for prune_graph
(rust
) it was 1,311,625, which is a difference of 589,222 SNPs between the two methods.
It's true that the order of magnitude is the same (1 million/1 million), but I just wanted to be sure that this discrepancy of 589,222 SNPs is within the expected range, and not indicative of something wrong.
Thanks so much for your help!
Glad to hear things are starting to make sense! 😄
Just for the sake of completeness:
the equivalent filter to your perl
command should be --weight-field "column_7" --weight-filter "column_3 <= 20000 && column_7 >= 0.5"
. Don't think it will make a big difference, but let me know if you run it again and what numbers you get.
another difference between both is that the perl
script uses the absolute weight by default, while the rust
implementation uses the weight as is (the user needs to convert the weights to whatever he/she sees fit beforehand); but again, this should not affect since column_7
is 0 < r^2 < 1
.
The two outputs are not expected to be exactly the same due to precision issues, but the difference should not be that large.
How many common SNPs between the 808,371 (rust
) and 219,149 (perl
) remaining SNPs?
EDIT: have you also tried the python script? Do you get the same?
Thanks as always for your patience in working with me to get prune_graph to work with my data.
I just re-ran prune_graph on the same dataset with the min_weight parameter modified from column_7 > 0.5
to column_7 >= 0.5
. It only took about 2 hours to run!
.../prune_graph --in dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_LD --weight-field "column_7" --weight-filter "column_3 < 20000 && column_7 >= 0.5" --out dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id
The resultant number of unlinked SNPs is now 488,972, down from 808,371 with column_7 > 0.5
. This still over 2x more SNPs remaining than the perl script run, which resulted in 219,149 SNPs, i.e. half as many linked SNPs are being identified with prune_graph as with the perl script, with the same parameters. Do you think the difference in the processing of weights could be resulting in this discrepancy?
How would you recommend checking for common SNPs between two beagle.gz files? I did some searching about how to do this but didn't find any great answers. I tried zgrep -Fxf FILE1.beagle.gz FILE2.beagle.gz, but it was taking over an hour so I just killed the process. Then I tried zgrep -c FILE1.beagle.gz FILE2.beagle.gz, and I got 0, which doesn't make sense..
Unfortunately I never managed to run the python script. There was a required module (graph-tool) that I tried installing with conda, but the environment could never be solved in the installation, so I have up and I went back to trying to use prune_graph.
I'd say that double the number, is a bit too much...
Just compare the output files from both methods (no need to compare the beagles).
You can use grep
or just:
cat prune_graph.rust.out prune_graph.perl.out | sort | uniq -c
Ah of course that makes sense.. using
grep -Fxf dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id
219,136 SNPs were identified as shared, which is nearly every single unlinked SNP identified with the perl script (219,149). It seems then that prune_graph
is identifying an addition 250k unlinked SNPs on top of those.
I'm attaching the two output files: dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5.id.txt generated with the perl
script, and dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_unlinked_20_0.5_pg3.id.txt generated with prune_graph
, as well as the shared SNPs file shared_SNPs_2x_perl_vs_rust_output3.txt, in case that might be helpful.
Did you do column_3 < 20000
or column_3 <= 20000
? If the former, can you try the later (just so it is the most comparable to the perl
script)?
The prune_graph
output file seems to be the old one, since it has 808265
lines.
Do you think you could generate a smaller dataset reproducing this issue? Debugging a 12.7 Gb file is kind of hard... 😄
I had used column_3 < 20000
. I have just run again the same dataset with column_3 <= 20000
. Actually, I went back and checked the final remaining unlinked SNPs numbers, and I realize I had made an error. The "488,972" final unlinked SNP number was from another dataset processed with the perl
script. Indeed, for the dataset in question, for each analysis, the results were the following:
Remaining unlinked SNPs
perl
script min_weight 0.5, max_kb_dist 20k: 219,149
rust
< 20000 > 0.5: 808,371
rust
< 20000 >= 0.5: 808,266
rust
<= 20000 >= 0.5: 808,261
So indeed, the slight tweaks of the rust
analysis parameters did not markedly change the final SNP numbers.. They remain about 4x greater than the perl
script output.
I ran a similar set of analyses on another (larger) dataset, and the results were as follows:
Remaining unlinked SNPs
perl
script min_weight 0.5, max_kb_dist 20k: 488,972
rust
< 20000 >= 0.5: 2,196,255
rust
<= 20000 >= 0.5: 2,196,238
Again with this dataset, the perl
script identified 4x more SNPs as linked than rust
. Indeed, with the perl
script, 11% and 9% of SNPs are retained as unlinked for both datasets, and with rust
, 42% and 41% of SNPs are retained as unlinked. This at least seems somewhat consistent to me.
I am happy to move forward using rust
for all of my linkage analyses. Indeed, it runs much more quickly, gives me a lot more SNPs to work with in the end, and works for all 4 of my datasets (whereas previously, my largest dataset could not be processed with the perl
script). I guess I would just feel more confident if I understood why the perl
script and rust
were behaving so differently with my data.
I would be happy to provide a smaller dataset to reproduce this issue, I'm just not sure how to go about doing this. I produced the LD input files for the perl
script and rust
using my beagle files produced in ANGSD. I believe that e.g. with samtools it should be possible to randomly subsample a bam file to make a 'smaller, random sample' of the data. Would one method be to sub-sample randomly to a smaller size all of my bam files that I used to produce the beagle file for a given dataset in ANGSD, and then this smaller beagle file would produce a smaller LD input file for perl
and rust
, etc.? Or is there perhaps a faster/quicker way?
Thanks so much for all of your help.
Thanks for looking so thoroughly into this. :smile:
It would be normal for the numbers to differ a bit (due to precision issues), but the difference should not be that big.
Can you send me a smaller file where I can reproduce the same pattern? A smaller dataset would be ideal but, if not possible, just get the first (e.g) 10000 lines from the beagle file and check if you see the same pattern.
Thanks for the advice for producing the smaller file. Of my 4 datasets, I took the smallest one, and extracted the first 10000 lines from the beagle file as follows
zcat dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X.beagle.gz | head -n 10000 > dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.beagle.gz
You can download that 'mini' beagle file from my Dropbox here (the file is 11 MB).
Let me know if that works, or if there's anything else I can provide that would help you reproduce this pattern. Thanks!
Tried running your file as:
ngsLD --geno dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.beagle.gz --pos dmawsoni2021_GL1_doMaf3_doMajorMinor1_doGlf2_minMaf0.05_SNPpval1e6_minInd29_NR_depth_DS2X_first10000.pos --n_ind 39 --n_sites 9999 --extend_out --outH out.ld
prune_graph --in out.ld --header --weight-field r2 --weight-filter "dist < 20000 && r2 >= 0.5" --out out.rust
tail -n +2 out.ld > out.ldH
prune_ngsLD.py --input out.ldH --max_dist 20000 --min_weight 0.5 --out out.python
perl prune_graph.pl --in_file out.ldH --max_kb_dist 20 --min_weight 0.5 | sort > out.perl
And got exactly the same result for all files:
$ md5sum out.*
b5493199f5729aee71884448e3891b6e out.perl
b5493199f5729aee71884448e3891b6e out.python
b5493199f5729aee71884448e3891b6e out.rust
Do you have a file where I can reproduce what you are seeing?
Hi,
thank you for this significantly faster pruning option.
Unfortunately I have problems on ~larger files (or too many lines?)~ some but not all files and they usually end up with a dumped core. This is sample command:
~/scripts/prune_graph/target/release/prune_graph --in snps.ld.chr1.trunc --weight-field column_7 --weight-filter "column_3 <= 50000 && column_7 >= 0.1" --out chr1.unlinked.pos
This is the error:
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
I use ngsLD files as input, either unedited or only the relevant columns: chr1:29495 chr1:29651 156 0.093005 -0.020154 1.000000 0.027559 chr1:30862 chr1:32349 1487 0.065533 -0.012921 0.999999 0.016464 chr1:25380 chr1:27944 2564 0.633290 0.075310 0.999990 0.724847 chr1:32386 chr1:32677 291 0.996459 0.100407 1.000000 1.000000 chr1:30862 chr1:32386 1524 0.891505 0.092751 1.000000 0.914650 chr1:25380 chr1:29403 4023 0.123450 0.063818 0.999946 0.167704
System has 1TB+ RAM, 64 cores. Dependencies were installed with conda.
I have no real knowledge of Rust so there's not much I can add here or try to fix myself.
Best regards
David