broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
72 stars 15 forks source link

Replicability Zika probes Metsky et al Nature Biotechnology #28

Closed joanmarticarreras closed 5 years ago

joanmarticarreras commented 5 years ago

Dear Heyden,

I am quite interested in both CATCH and your paper on Nature Biotech (https://www.nature.com/articles/s41587-018-0006-x#MOESM3) as a method to define custom probes for viruses. CATCH its exactly the tool we were looking for!

The first try outs that we did were with HHV-5, but as we were unsuccessful obtaining whole genome probe coverage (10-20 probes depending on the parameters), we decided to replicate the ZIKA probe design that is offered in GitHub, using the parameters from your paper on NB.

Your recommended parameters there are


Name | Taxonomic lineage(s) included | Input sequences | mismatches | cover-extension | probe-length | probe-stride | lcf-thres | island-of-exact-match | Include reverse complement probes | Add adapters | Expand Ns | Number of probes

zika | Flaviviridae,Flavivirus,Zika   virus | https://github.com/broadinstitute/catch/blob/24bf7d4ea924c1afbaa02f59c9a8c4c7051b1a7b/hybseldesign/datasets/data/zika.fasta | 2 | 10 | 75 | 75 | 75 | 30 | No | No | Yes | 1081

So we run design.py download:64320 -pl 75 -ps 75 --lcf-thres 75 --island-of-exact-match 30 -m 2 -e 10 --max-num-process 5 -o zika-probes.fasta --verbose --expand-n -o zika-probes.fasta

This only yielded 4898 of the 1081 probes listed in your paper. Even if more ZIKA genomes are available now, we don't think that the amount of probes should be reduced in half. Could guide us in to fully replicate it?

Additionally, we mapped the probes back to a reference for Zika (KY415991.1) and we replized that the tiling is quite uneven. Just to confirm, is this the behaviour should we expect? Is this tiling comming from taking into account the diveristy in the sample?

Looking forward to hear from you soon.

Joan

catch_zika_test.pdf

haydenm commented 5 years ago

Hi Joan,

It’s great that hear that CATCH is what you were looking for and that it may be helpful.

You’re running CATCH correctly. The reason for the difference you observed is because of the different numbers of input ZIKV genomes, as you suspected it might be. In the V-All design in the paper, we used these input sequences (i.e., in the row from Supplementary Table 1 that you copied). You can download that fasta file and use it as input: keep the same arguments you’re currently using, but replace download:64320 with a path to the file. When doing that, you should find that it designs 1,081 probes, as reported in Supplementary Table 1.

It is quite a difference to go from 1,081 probes to the ~4,900 that you designed with recent ZIKV sequences. But keep in mind that we designed V-All a while ago, using sequences through June 2016. At that time, there were very few ZIKV genomes available from the Americas. In the input we used, from the link above, there were just 82 ZIKV genomes, and many of these are older genomes of the African lineage. Despite this, the probe set worked well on samples from the recent epidemic, as we showed in the paper. As of now (via download:64320), there are 668 ZIKV genomes. So the diversity in the input is quite different. ZIKV is one of the starkest examples I can think of for a recent explosion in the number of sequences and observed diversity.

Regarding the coverage you see when mapping the probes back to a reference: yes, you should expect to see uneven coverage. In general, there will be more probes in more diverse regions of the genome.

However, ZIKV is a bit of an unusual case. Many (or maybe most) of the genomes have significant amounts of ambiguity (Ns) because it is hard to sequence. As a result, --expand-n could have a considerable effect for ZIKV. I’m not sure about this, but I suspect that the very deep pileups you see at some positions are due to ambiguity. You might not see this for other species that tend to have less ambiguity in their available genomes.

How were you running CATCH for the design for HHV-5? You should obtain whole genome coverage from the probes, as long as your input genomes are complete and you’re running it correctly.

Hayden

joanmarticarreras commented 5 years ago

Dear Hayden,

As for HCMV I run (as in the SI of the paper):

design.py download:10359  -pl 75 -ps 25 --lcf-thres 75 --island-of-exact-match 30 --expand-n -m 3 -e 40 -o HCMV_probes.fasta --verbose

We obtained 8479 probes, vs the 7704 that were obtained from the article. Recently, new genomes have been published, therefore I assume again that corresponds to the newly found diversity.

After mapping agains HCMV Merlin RefSeq reference, of those 8479 , only 5594 probes mapped correctly. Additionally, there are tilling gaps on the mapping. Would you clarify if those gaps (< 50 nt) are going to excluded from the capture, or those regions may be dragged down with the capture of neighbouring regions?

Thank you very much.

Joan

haydenm commented 5 years ago

Hi Joan,

That all looks good to me.

It is expected that some probes will not map to a RefSeq. The probes were designed across all the available sequences and therefore some may account for HCMV strain diversity that is divergent from the RefSeq you’re using. These probes may be too divergent from the RefSeq sequence to align to it.

It is also expected that there will be gaps because you’re using -e 40. This parameter, which we termed the cover extension, assumes that 40 nt of sequence is dragged down by capture on each side of the probe hybridization. Supplementary Figure 1b in our paper shows this visually. -e 40 could in theory allow gaps as large as 80 nt because you can have two probes, P1 and P2, with 80 nt between them where P1 pulls down 40 nt to the right of it and P2 pulls down 40 nt to the left of it.

We’ve found that this is a safe assumption with our libraries and capture protocol. If you have reason to think this will not be the case concerning capture — for example, if your library fragments are going to be very short — you can lower -e or set it to 0 to get rid of this feature entirely. Note that lowering -e will raise the number of probes.

Separately, I’d point out that you shouldn’t feel compelled to use exactly the parameters we list in Supplementary Table 1. In V-All, these were selected by CATCH (pool.py) with respect to 355 other viral species, to obtain an overall probe count of <350,000. Feel free to adjust them depending on your application.

Hayden