bcgsc / physlr

:chains: Construct a Physical Map from Linked Reads
GNU General Public License v3.0
18 stars 8 forks source link

Parameters for f1chr4 example? #169

Closed evolgenomics closed 2 years ago

evolgenomics commented 3 years ago

I'm following the f1chr4 example to replicate the results.

But I haven't been able to get sensible results from the f1chr4 dataset. In fact I couldn't get any edges to be retained. The main issue seems to be that there are simply not that many molecules - and so all the vertices have been pruned. See output below.

This isn't the expected behaviour, right? If so, what were the parameters you were using to get this to work? I'm exploring them now, but some guidance would be helpful.

Thank you, Frank

../physlr/bin/physlr-make physical-map ref=fly lr=f1chr4 draft=f1.supernova n=25 minimizer_multiplicity=x m=25 n=25 N=1000 k=40

The mode k-mer coverage is 276. The repeat k-mer coverage is 828. command time -v -o f1chr4_k40.rep.time nthits -t16 -k40 -c828 -p f1chr4 f1chr4.fq.gz Reapeat profile estimated using ntCard in (sec): 22.3568

Errors k-mer coverage: 97 Median k-mer coverage: 276 Repeat k-mer coverage: 828 Approximate# of distinct k-mers: 55646439 Approximate# of solid k-mers: 13077 Total time for computing repeat content in (sec): 69.5050

command time -v -o f1chr4.k40.bf.time /fml/chones/local/physlr/src/physlr-makebf -t16 -k40 -b10000000000 -v -o f1chr4.k40.bf f1chr4_k40.rep Collecting Kmers to insert into bloom filter Made Bloom filter with: kmer size = 40 Bloom filter size = 80000000000

Inserting 0 kmers into Bloom filterUsing 16 threads. Bloom filter stats:

counters = 80000000000

size (B) = 10000000000

popcount = 0 FPR = 0% Writing a 10000000000 byte filter to f1chr4.k40.bf on disk.

command time -v -o fly/fly.k40-w32.physlr.tsv.time /fml/chones/local/physlr/src/physlr-indexlr -t16 -k40 -w32 -r f1chr4.k40.bf --pos -o fly/fly.k40-w32.physlr.tsv fly/fly.fa Loading repeat Bloom filter from f1chr4.k40.bf Finished loading repeat Bloom filter

pigz -p16 -cd f1chr4.fq.gz | command time -v -o f1chr4.k40-w32.physlr.tsv.gz.time /fml/chones/local/physlr/src/physlr-indexlr -t16 -k40 -w32 -r f1chr4.k40.bf - | pigz -p16 >f1chr4.k40-w32.physlr.tsv.gz Loading repeat Bloom filter from f1chr4.k40.bf Finished loading repeat Bloom filter time user=10.17s system=1.87s elapsed=22.60s cpu=53% memory=2 job= time user=178.20s system=16.09s elapsed=26.29s cpu=738% memory=10139 job= time user=75.27s system=1.42s elapsed=26.31s cpu=291% memory=7 job=

pigz -p16 -cd f1chr4.k40-w32.physlr.tsv.gz | command time -v -o f1chr4.k40-w32.n100-5000.physlr.tsv.gz.time /fml/chones/local/physlr/src/physlr-filter-barcodes -n100 -N5000 - | pigz -p16 >f1chr4.k40-w32.n100-5000.physlr.tsv.gz Time at readMxs (ms): 15886 Time at countMxs (ms): 18067 Counted 2972750 minimizers. Time at removeSingletonMxs (ms): 22939 Removed 2585978 minimizers that occur once of 2972750 (87.0%) There are 77715 barcodes. Time at filterbarcodes (ms): 23010 Discarded 58504 barcodes with too few minimizers of 77715 (75.3%) Discarded 0 barcodes with too many minimizers of 77715 (0.0%) Wrote 19211 barcodes Time at writeMxs (ms): 27602 time user=4.79s system=1.07s elapsed=15.89s cpu=36% memory=3 job= time user=25.58s system=2.23s elapsed=27.82s cpu=99% memory=1494 job= time user=56.67s system=0.71s elapsed=27.85s cpu=206% memory=12 job=

pigz -p16 -cd f1chr4.k40-w32.n100-5000.physlr.tsv.gz | command time -v -o f1chr4.k40-w32.n100-5000.c2-x.physlr.tsv.time /fml/chones/local/physlr/src/physlr-filter-bxmx -o f1chr4.k40-w32.n100-5000.c2-x.physlr.tsv - Time after reading minimizers (ms): 5204 Read 19211 barcodes.

Time after counting minimizers (ms): 6276 Counted 371588 minimizers.

Time after removing singleton minimizers (ms): 7843 Removed 13298 minimizers that occur once of 358290 (3.7%)

Time after filtering barcodes (ms): 7844 Discarded 0 barcodes with too few minimizers of 19211 (0.0%) Discarded 0 barcodes with too many minimizers of 19211 (0.0%) There are 19211 barcodes remaining

Time after counting minimizers (ms): 8864 Counted 358290 minimizers.

Time after removing singleton minimizers (ms): 10413 Removed 0 minimizers that occur once of 358290 (0.0%)

Time after removing repetitive minimizers (ms): 12780 Minimizer frequency: Q1=2.0 Q2=3.0 Q3=11.0 C= 24 Removed 81559 most frequent minimizers of 358290 Removed 13 empty barcodes of 19211 There are 19198 barcodes remaining. Time after writing minimizers (ms): 13853 Wrote 19198 barcodes.

time user=2.88s system=0.42s elapsed=5.21s cpu=63% memory=2 job= time user=12.92s system=1.07s elapsed=13.99s cpu=99% memory=1232 job=

command time -v -o f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.tsv.time /fml/chones/local/physlr/src/physlr-overlap -t16 -m10 f1chr4.k40-w32.n100-5000.c2-x.physlr.tsv >f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.tsv Loading file f1chr4.k40-w32.n100-5000.c2-x.physlr.tsv Finished constructing barcodeToMinimizer and minimizerToBarcode in sec: 0.896744 Memory usage: 0.14225GB Populating Overlaps Total Minimizers: 276731 Total Barcodes: 19198 Finished computing overlaps in sec: 0.133628 Memory usage: 1.19699GB Total number of unfiltered edges: 2804 Total number of filtered edges: 1285681

command time -v -o f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.tsv.time env PYTHONPATH=/fml/chones/local/physlr python /fml/chones/local/physlr/bin/physlr filter-overlap --minimizer-overlap 25 -V1 f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.tsv >f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.tsv 0 Processing Nodes 0 Processing Edges 0 Sorting Edges 0 Filtering Edges 0 Lower Threshold 9 command time -v -o f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.tsv.time env PYTHONPATH=/fml/chones/local/physlr python /fml/chones/local/physlr/bin/physlr molecules -V1 -t4 --separation-strategy=distributed+sqcosbin f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.tsv >f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.tsv 0 Reading f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.tsv 0 Read f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.tsv 0 Separating barcodes into molecules using the following algorithm(s): distributed + sqcosbin 3 Identified molecules 3 Identified 281 molecules in 19198 barcodes. 0.01 mean molecules per barcode 3 Separated molecules 3 Removed 6 isolated vertices. 3 Wrote graph command time -v -o f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.backbone.tsv.time env PYTHONPATH=/fml/chones/local/physlr python /fml/chones/local/physlr/bin/physlr backbone-graph --prune-branches=10 --prune-bridges=10 --prune-junctions=200 -s0 -V1 f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.tsv >f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.backbone.tsv 0 Reading f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.tsv 0 Read f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.tsv 0 Determining the backbone-induced subgraph. 0 Determined the maximum spanning tree. 0 Pruned 230 vertices of 275 (83.64%) in 1 iterations. 0 Removed 0 vertices in 0 bridges shorter than 10 vertices. 0 Removed 0 bridges in 0 iterations. 0 Determined the maximum spanning tree. 0 Pruned 230 vertices of 275 (83.64%) in 1 iterations. 0 Pruned 0 junctions in the backbone 0 Assembled 0 molecules in 0 paths. Traceback (most recent call last): File "/fml/chones/local/physlr/bin/physlr", line 19, in physlr.physlr.main() File "/fml/chones/local/physlr/physlr/physlr.py", line 2873, in main Physlr().main() File "/fml/chones/local/physlr/physlr/physlr.py", line 2869, in main getattr(Physlr, method_name)(self) File "/fml/chones/local/physlr/physlr/physlr.py", line 1250, in physlr_backbone_graph self.write_graph(subgraph, sys.stdout, self.args.graph_format) File "/fml/chones/local/physlr/physlr/physlr.py", line 174, in write_graph Physlr.write_tsv(g, fout) File "/fml/chones/local/physlr/physlr/physlr.py", line 154, in write_tsv if "mol" in next(iter(g.nodes.values())): StopIteration ../physlr/bin/physlr-make:950: recipe for target 'f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.backbone.tsv' failed make: [f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.backbone.tsv] Error 1 make: Deleting file 'f1chr4.k40-w32.n100-5000.c2-x.physlr.overlap.m25.mol.backbone.tsv'

aafshinfard commented 3 years ago

Hello Frank @evolgenomics , Thank you for your interest in testing our tool, So there have been changes to Physlr after we included the f1chr4 example. I will be able to fully try the example myself at the end of this week and troubleshoot, but my best guess is that you'll be able to fix things by changing the value for m. Try larger values for the m parameter as we usually do. The last experiment I had with this dataset was with an older metric (n) instead of m but I would say m=25 is too low and you may test m greater than 50 or even 70... just give different values tries. You MAY also need to change the algorithm for separating molecules. I would be happy to hear back about your experience. If you're not in a rush I will test them in a week hopefully.

More details on m: So m determines what percent of the edges (~roughly) will remain after filtering. We sort edges based on their weight and filter until we have less than m% of the edges left... sometimes, when the edge values saturate at around a certain value, you see that the amount of edges filtered is much more than the m value you chose... You chose m=25 and I am guessing that many edges in that range were of the same edge value so they all got filtered at once and the remained edges went under 1-2%.

skagawa2 commented 2 years ago

Hello! I know it has been awhile, but do you happen to have any update for this? I want to test to see if my installation is correct.

aafshinfard commented 2 years ago

Hello @skagawa2, sorry for the delay in response, Physlr will be shortly available via conda for an easier installation. We are also working on preparing some demo examples based on the latest version of Physlr.

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in Physlr!

aafshinfard commented 2 years ago

@evolgenomics @skagawa2 Physlr is now available thru Conda, please see https://github.com/bcgsc/physlr#readme for details