faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
76 stars 48 forks source link

Question about using phyluce on whole-genome shotgun reads #204

Closed MareikeJaniak closed 3 years ago

MareikeJaniak commented 3 years ago

Hi!

Thanks for making this resource available and having such detailed tutorials!

I was wondering if you might have some suggestions or advice for using phyluce for identifying UCEs from whole-genome shotgun reads (not enriched for UCEs). The samples are from primates and were sequenced to ~30x, so the normal assembly step would likely take a very long time.

Would you recommend subsampling the reads before assembly to speed this step up? If so, what depth would you recommend for still being able to retrieve a large number of UCEs? I subsampled some of the reads to retrieve mitochondrial genomes with MitoFinder and ran the resulting contigs (assembled with metaspades during the MitoFinder pipeline) through phyluce, but I only retrieved 300-400 UCEs per sample, probably because I downsampled quite a bit for MitoFinder (~3.5 Million PE reads).

Or am I approaching this from the wrong angle and missing an easier solution? Would mapping to a reference be a better way to go?

I'd appreciate any suggestions you might have! Thanks!

Best, Mareike

mateusf commented 3 years ago

Hi Mareike,

You can try to use itero, https://itero.readthedocs.io/en/latest/

Using one of the probeset for UCEs that are available at https://www.ultraconserved.org/

All the best,

Mateus

From: MareikeJaniak notifications@github.com Sent: Friday, October 16, 2020 3:49 AM To: faircloth-lab/phyluce phyluce@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [faircloth-lab/phyluce] Question about using phyluce on whole-genome shotgun reads (#204)

Hi!

Thanks for making this resource available and having such detailed tutorials!

I was wondering if you might have some suggestions or advice for using phyluce for identifying UCEs from whole-genome shotgun reads (not enriched for UCEs). The samples are from primates and were sequenced to ~30x, so the normal assembly step would likely take a very long time.

Would you recommend subsampling the reads before assembly to speed this step up? If so, what depth would you recommend for still being able to retrieve a large number of UCEs? I subsampled some of the reads to retrieve mitochondrial genomes with MitoFinder and ran the resulting contigs (assembled with metaspades during the MitoFinder pipeline) through phyluce, but I only retrieved 300-400 UCEs per sample, probably because I downsampled quite a bit for MitoFinder (~3.5 Million PE reads).

Or am I approaching this from the wrong angle and missing an easier solution? Would mapping to a reference be a better way to go?

I'd appreciate any suggestions you might have! Thanks!

Best, Mareike

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/faircloth-lab/phyluce/issues/204 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJ4Q4K36DQUOM3JPNWA4YTSK73IBANCNFSM4SS7UN7A .

brantfaircloth commented 3 years ago

Hi Mareike and Mateus,

Hope you are both doing well! And, thanks Mareike for the kind words - I am glad you found the tutorials helpful.

Basically, I would do something like what Mateus suggests - harvest UCE loci from a somewhat closely related organism (shouldn't be hard to find since you are working w/ primate data), then "build" your UCE loci from the sequence data you have using a reference-based assembly approach like aTram or itero.

When harvesting the loci to use as a reference, they don't need to be from the same species (that's not a huge concern), and I would probably harvest loci with 500 bp of flanking sequence (this gives you 500 in both directions from the "center" of the UCE). You could go to ±750 bp, but anything larger than that is probably overkill.

Once you assemble, you can run those loci through the phyluce pipeline like usual.

MareikeJaniak commented 3 years ago

Hi Mateus, hi Brant,

Thanks for the suggestions!

I don't know why I hadn't thought of aTRAM before, that's a great idea and I already have aTRAM libraries built for some of my samples. I appreciate the quick responses!

Best, Mareike

MareikeJaniak commented 3 years ago

Hi Brant,

I just wanted to give an update on this, in case it's helpful for anyone else in the same situation:

Assembling the loci from whole genome shotgun reads with aTRAM worked very well after some troubleshooting. Some tips for streamlining the aTRAM assembly process for UCEs:

There was a lot of variation in how long it took to assemble each locus, which caused some issues initially. While many loci would assemble in <10 minutes, other would take 4+ hours. The key to speeding things up was reducing the number of blast hits with the --max-target-seqs option and setting the number of iterations to only 3 with the --iterations option. For very large aTRAM libraries, it also helped to use the --fraction option to only use a portion of the full library. Finally, parallelizing the assemblies with gnu parallel sped everything up dramatically.

I filtered the aTRAM assemblies to only keep the top contig for each locus and then ran that through the phyluce pipeline.

Thanks again for your help!

Best, Mareike

liugang9988 commented 3 years ago

Mareike

Hi Mareike,

I was in the trouble situation when using aTRAM to assembly UCE loci from illumina PE reads. The command atram_preprocessor.py worked successfully. Here attached was the atram_preprocessor.py command: `atram_preprocessor.py \ --blast-db=./data_base/Afaf1s \ --end-1=./Afaf1s.R1.fastq \ --end-2=./Afaf1s.R2.fastq

However, it failed in assemblying loci when running atram.py. reporting

(aTRAM) [llxss@login04 aTRAM]$ ./atram.py \

--blast-db=./data_base/Afaf1s \ --query=./uce_probe.fasta \ --assembler=spades \ --output-prefix=./out_put/Afaf1s \ --log-file=./Test.log \ --temp-dir=./tmp \ --keep-temp-dir 2020-11-17 10:45:51 INFO : ################################################################################ 2020-11-17 10:45:51 INFO : aTRAM version: v2.3.4 2020-11-17 10:45:51 INFO : Python version: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] 2020-11-17 10:45:51 INFO : ./atram.py --blast-db=./data_base/Afaf1s --query=./uce_probe.fasta --assembler=spades --output-prefix=./out_put/Afaf1s --log-file=./Test.log --temp-dir=./tmp --keep-temp-dir 2020-11-17 10:45:51 INFO : aTRAM blast DB = "./data_base/Afaf1s", query = "uce_probe.fasta", iteration 1 2020-11-17 10:45:51 INFO : Blasting query against shards: iteration 1 2020-11-17 10:45:57 INFO : All 1 blast results completed 2020-11-17 10:45:57 INFO : 1 blast hits in iteration 1 2020-11-17 10:45:57 INFO : Writing assembler input files: iteration 1 2020-11-17 10:45:57 INFO : Assembling shards with spades: iteration 1 2020-11-17 10:46:12 ERROR: Exception: Command 'spades.py --only-assembler --threads 10 --memory 37 --cov-cutoff off -o ./tmp/atram_a2ay1lql/Afaf1s_uce_probe.fasta_01_yklcj4rf/spades --pe1-1 '/n/satch01/UCE_sub/aTRAM/tmp/atram_a2ay1lql/Afaf1s_uce_probe.fasta_01_yklcj4rf/paired_1.fasta' --pe1-2 '/n/satch01/UCE_sub/aTRAM/tmp/atram_a2ay1lql/Afaf1s_uce_probe.fasta_01_yklcj4rf/paired_2.fasta'' returned non-zero exit status 1. 2020-11-17 10:46:12 ERROR: Exception: [Errno 2] No such file or directory: './tmp/atram_a2ay1lql/Afaf1s_uce_probe.fasta_01_yklcj4rf/spades/contigs.fasta'

MareikeJaniak commented 3 years ago

Hi!

I actually got the same error when I tried to use spades as my assembler. I didn't try to figure out what was going on, I just ended up using Velvet as the assembler instead. Abyss might work too. Note that for Velvet, I added a coverage cutoff, which improved the assemblies dramatically. (The older versions of aTRAM had that option built in for velvet, but the latest one didn't, so I just added it to the script myself).

Good luck! By the way, Julia over on the aTRAM github page is very responsive and helpful, she might be able to figure out what's going on with spades.

Best, Mareike