Closed zjnolen closed 2 years ago
Hello @zjnolen,
Thanks once again for sharing your Python version. Filipe asked me to test it a bit against his Perl version. So, a few things:
So far so good. However, I have noticed that the retained nodes per se are not the same. In my case, I have 20,639 retained nodes, but 2,565 are only found in the Perl output and 2,568 are only found in the Python output. Thus, we suspect there is some difference regarding which nodes the scripts are retaining. Could you please confirm that you also observe this difference? And do you think you could maybe adjust your script to retain nodes in the same way as the Perl version?
Another thing is that I cannot run your script with the quoting=csv.QUOTE_NONE
option. Thus, my output gets some quotes like this: "NC_046966.1"_5792
. May I ask if you would know what could be happing? I have tried different options (e.g. quoting=csv.QUOTE_NONE, quotechar="", escapechar="\\"
OR quoting=3
from here), but I could not solve the issue.
Thanks a lot, George.
Hi George,
Thanks for testing and catching that! Looking at my outputs, I see the same, some sites that are unique to each. What I imagine this relates to is how the scripts decide which node to prune when two nodes are of equal weight. I could imagine a situation where you end up with the same number of nodes at the end this way, but some of them being different sites. Looking back at the perl script, there is more going on with the sorting there when finding the heaviest node than what mine does (finds the heaviest and prunes the first that the graph reports back). I've tried it again having it sort by name, then pruning the first, but this results in just a different number of mismatching sites.
I'll have to do a bit of testing to figure out if this is really why. If it is sorting by the index assigned in the perl script whenever it finds nodes of equal weight, I think it may not be possible if graph-tool doesn't perform indexing in the same way as the perl script does. But if it is by some sort of metric that can be constant between the two scripts, there should be a way. If that's not it at all then I'll have to dig some more.
For the quoting, do you mean that you have to remove the quoting=csv.QUOTE_NONE
in the pruned_df.to_csv
line to get the code to run? Or is it still outputting with quotes even with that line there?
I'll be out of office for a week or so, so won't be able to get to this right away, but I'll update you once I've had a chance to look closer.
Best, Zach
Hey @g-pacheco,
I think this should now return the same sites the original does, it's been working on my data in tests so far. It was two things I missed from the original. One was that edge weights were converted to integers and truncated, the other was that when there's a tie for heavier nodes, it prunes the first one alphabetically.
Best, Zach
Hej @zjnolen,
I hope you had a great break. Thanks a lot -- I will test it quickly with my data and I will let you know the results.
As for the ""
issue, I think it has to do with the loaded Python version because sometimes I get them and sometimes I do not. So I would not bother with that to be honest. I will try to find out which Python version works well so you know.
Best regards, George.
Hello @zjnolen,
Sorry for the time it took me to test this. Unfortunately, I am still getting a difference in the nodes prunedIN (the number is again the same).
Perl: 142 nodes Python: 142 nodes
But there are 25 nodes that are not shared. I am attaching here my test file (1K SNPs). Could you please maybe try to replicate this behaviour?
BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz
Here is how I ran your script:
python New.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD.gz --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT
Many thanks in advance, George.
Hey @g-pacheco,
Thanks for the testing and test data! Oddly, I could not replicate this. I ran it through both scripts on my end and it returned BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN
with the same sites. For checking I used:
comm -3 <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python) <(sort BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl)
Which gave me no lines unique to either file (I also switched over to :
as the chr:pos separator just to make comparing simpler and to handle contigs with _
in the name better, which might have been why it was quoting in some versions of python if an escape character wasn't given). I did find that the exclude files were different though, and realized that I should have been writing nodes to the dropped list before dropping them, or another node would be written. I've fixed that, and now the exclude files are also the same on my end as well. I've attached the outputs here, are they similar to either the python or perl outputs you get?
Thanks, Zach
Hello @zjnolen,
Thanks for the last tweaks. I still get a difference of 25 nodes. May I ask how you precisely run your script? Maybe I am missing some flag?
Best, George.
Hi @g-pacheco,
I think I see what's different here, the prune_graph.pl script has had an update since the v1.1.1 release, and these two versions actually return different sites due to the implementation changes. Using the v1.1.1 version, I also get a difference of 25 sites between that and the python script. However, if I use the most recent commit I get a difference of 0 with the python script. I based the python one on the most recent commit, so that makes sense with it matching that one. Do you see the same difference in scripts? These were the commands I used:
./prune_ngsLD.py --input BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_dist 100000 --min_weight 0.2 --output BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.python --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.python
./prune_graph.pl --in_file BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.LD --max_kb_dist 100 --min_weight 0.2 --out BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneIN.perl --print_excl BSG_Lumpfish--AllSamples_Ind30_SNPs.1K.pruneOUT.perl
Best, Zach
@zjnolen thanks a lot for your contribution and patience testing it out! 😄
Hello @zjnolen,
I apologise for this confusion -- I did not know that there was a new version of the prune_graph.pl
script. I can now confirm that the results are indeed identical (I also tried with 10K). I believe Filipe might be accepting your pull request soon.
Thanks once again for sharing your code.
Best regards, George.
Thanks for all the testing help, think it's in much better shape now! I had not realized it had changed either until the last test 😅 Only noticed when I started also getting differences on my local recent install and our cluster's 1.1.1 install as well.
Hey @fgvieira! I have finished the rewrite of your pruning method into graph-tool for python as we talked about in #25, and I think will be helpful for some with run time issues. I've tested it on one of my smaller datasets of ~60000 nodes and it runs in a little under an hour, compared to the perl script which took 6 hours. I've also found that this run time difference can increase with the number of nodes. Overall, it seems to result in the same outcome, with the pruning returning the same sites:
Both used similar amounts of RAM, which unfortunately is still a bit high (usually approaching the size of the uncompressed ngsLD output), so I still tend to make beagle files in chunks, prune them, then merge them. This is also good for run time, as even if it can be faster at more nodes, it still can take quite a long time.
I also tried to include most of the options found in your script in this one. I've tested them all, but not thoroughly, and some I'm sure could be coded better, but they at least seem to work. As far as the defaults go, I've been using them in a pipeline I'm working on for awhile now with no problems.
Let me know what your thoughts on this are and if you have any suggestions or tests you'd like done!