broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
76 stars 16 forks source link

catch on large input #40

Closed boucherufl closed 4 years ago

boucherufl commented 4 years ago

I am trying to run catch on a large number of sequences. I have read the documentation and trying to use "--cluster-and-design-separately" command but continually get the following error:

design.py: error: argument --cluster-and-design-separately: 0.85 is an invalid average nucleotide dissimilarity"

I also tried 85, rather than 0.85 and get the same error. What type of value should this parameter be?

Best, Christina

boucherufl commented 4 years ago

Hi all, I have followed the instructions from Issue #32 but catch has still been running for over 7 days with no progress. Can someone give some feedback or additional advice?

haydenm commented 4 years ago

Hi Christina,

Sorry for the delay.

The value for --cluster-and-design-separately represents a nucleotide dissimilarity, not similarity, and thus should be much less than 0.85. At 0.85 there would generally be only one cluster containing all the sequences because the sequences would be at most 85% dissimilar (equivalently, ANI of at least 15%) owing to chance. Values should usually be around 0.1-0.3. Running design.py --help gives more detail for the argument; as it notes, design.py does not permit values >0.5 to avoid misunderstanding/misuse of the argument. The help message should also describe how the value affects runtime.

If you are now using smaller values but it is still taking a long time, it would be helpful if you could summarize your input data: e.g., the number of sequences, their average length, and anything else that might be relevant. When you say it does not make progress, where does it hang up? For these kinds of cases, I suggest running CATCH using --verbose so it outputs a more detailed log on the progress.

Hayden

boucherufl commented 4 years ago

Hi Hayden,

Yes, I did try --help. I am actually from the computer science department.

We took the original CATCH viral dataset, added a single small genome of interest and trying to regenerate the baits. We cannot use the original CATCH baits because they are 85 bp long. We are using Agilent which requires the baits to be 120 bp long.

I have used the method with/without --verbose. I tried it without on the suspension that the disk I/O was slowing it down.

This is originally what I tried.

python design.py -pl 120 -ps 120 -m 3 -e 20 --small-seq-skip 120 --write-analysis-to-tsv TSV.out -o virome_probes ../../catch/catch/datasets/data/all_viral_genomes.fasta.gz

It took 5 days without halting.

I found this post: https://github.com/broadinstitute/catch/issues/32 and followed those instructions to give me the following:

nohup python ../catch/bin/design.py -pl 120 -ps 120 -m 3 -e 20 --small-seq-skip 120 --cluster-and-design-separately 0.3 --cluster-from-fragments 50000 --filter-with-lsh-hamming 2 --verbose --write-analysis-to-tsv Boucher-TSV.out -o virome_probes /home/noyes046/shared/projects/covid19_rapidgrant/E_test/.catch/catch/datasets/data/all_viral_genomes.gz &

After running for over a week I killed it. The last lines of the nohup are:

2020-07-22 13:07:49,917 - catch.utils.seq_io [INFO] Reading fasta file /home/noyes046/shared/projects/covid19_rapidgrant/E_test/catch/catch/datasets/data/all_viral_genomes.fasta.gz 2020-07-22 13:08:54,261 - catch.filter.probe_designer [INFO] Clustering 426011 sequences using MinHash signatures, at an average nucleotide dissimilarity threshold of 0.100000 2020-07-22 13:08:54,262 - catch.utils.cluster [INFO] Producing signatures of 426011 sequences 2020-07-22 13:25:31,044 - catch.utils.cluster [INFO] Creating condensed distance matrix of 426011 sequences

I am rerunning it again on a more powerful server (albeit, the original server should be more than adequate) and increasing the --cluster-and-design-separately to 0.3.

Can you give me some insight into what is going on?

Thanks. Christina

haydenm commented 4 years ago

Hi Christina,

design.py was not built to be scalable, when used on its own, across 100,000s of divergent genomes. It might work if they are all related to each other, but is going to be slow if they represent 100s of species. I think this is what is causing your problem.

What we usually do is run design.py on each individual species and varying parameters, and then run pool.py across the species. While feasible, even this approach, with variable parameter settings, can be computationally expensive for the application across hundreds of viral species; it is unfortunate that a full run is necessary to increase the probe length. Since you have already identified parameter settings (-m 3 -e 20), if you are flexible on the final probe count a simple solution might be to just run design.py separately for genomes from each species or genus, and then concatenate the outputs. This could create some redundancy where there is similarity among species/genuses, but that is not necessarily a bad thing. Alternatively, you could take the parameters we identified for each dataset (Supplementary Table 1 of our paper, for the original V-All probe set) and use these in each run of design.py.

Since you mentioned using the original CATCH viral dataset, I just would like to mention that there are updated probe designs and NCBI genome data compared to the data used for that first V-All design, which dates back to June 2016. There are updated datasets from Oct 2018, but the newer approach I suggest is to specify taxonomic IDs to download (described here).

Hayden