broadinstitute / catch

A package for designing compact and comprehensive capture probe sets.
MIT License
76 stars 16 forks source link

Is it possible to design high-throughput DNA probles for plant genomes using CATCH? #62

Open ShawnGao911101 opened 1 month ago

ShawnGao911101 commented 1 month ago

I have 10,000 SNP on maize genome to design DNA probe for them.

Is there a way that I can use CATCH to do that?

Thank you!

haydenm commented 4 weeks ago

CATCH is geared toward designing probes for whole-genome capture of diverse taxa. While it would be possible in theory to use it for designing probes for SNP genotyping of maize, that's not really what CATCH was meant for and there are probably other tools better suited for that. For example, if you're storing the SNPs in a file format like VCF or BED, CATCH won't accept a format like that as input. If you really wanted to use CATCH, you could start by converting the file with SNPs (e.g., using bedtools) into a FASTA file in which each sequence is the region of the maize genome around a SNP.

ShawnGao911101 commented 3 weeks ago

Thank you for your reply.

As you suggested, I may need to cut 10,000 maize sequences and save them in a single FASTA file.

Then, design.py will treat them as 10,000 genomes to design probes for each of them.

I made a test with a four-sequences input fasta and output file looks like:

>probe_bbb5017153 CGTGCACGACCATCGAGACGTCGCGGAAACTCGTCGTTTTTGTCGTTCGGGCCACTTTCATGGGCTATAGCACAC >probe_073d252a5b GCACGGCTATCGAGACTTTGCAAAAACTCGTTGTTTTTGTCGTTTTGGGCCAGTTTTGTGGGCTATAACACACTG >probe_b4f78a9871 ACGACTACTGATACGTCGCAAAAACATATTATTTTGTCGTTTCGGGTGAGTTTCGTGGCTATAGCGCACTGTTTT >probe_d13eaa7b6c CTATCGAGATGTTACAAAAACTGGTTGTTTTGTCGTTTTAGGCCAGTTTCGTGGGCTATAACGCACCGTTTTGGG

I'm not sure whether the probe ID is consistent with the order of input seuqence? The probe ID seems have no connection with the input sequence ID.

And, if I got 10,000 probes from design.py as output, can use all of these probe as a panel? or need to use pool.py to optimize it?

Really looking forward your reply.

Many thanks.

haydenm commented 2 weeks ago

That's right—the probe ID has nothing to do with the input sequence ID (and is just a hash of the probe sequence) because a probe may not necessarily derive from a single input sequence. You can just use the output of design.py for your probes without needing to run pool.py. Nevertheless, to repeat what I wrote in my earlier message, I suggest you look at other software for your problem because it doesn't sound like CATCH is geared for your particular design need; unless you customize hybridization parameters (e.g., with --custom-hybridization-fn) you're unlikely to get probes that are any different from a naive design strategy that tiles along the regions around your SNPs.