biocore / emp

Code repository of the Earth Microbiome Project.
http://www.earthmicrobiome.org
BSD 3-Clause "New" or "Revised" License
154 stars 68 forks source link

Question: fastaq for release 1 #102

Open galud27 opened 6 years ago

galud27 commented 6 years ago

Hi, Thanks for putting all the code and data analysis for the emp release 1. It's great! I wanted to ask a question, I am very interested in performing a smaller study looking at the data but using a different pipeline for the microbial community analysis. I was able to get fastaq files for all the studies, but the data is only available in single reads in ENA. I was wondering if there any studies with pair-end sequences fastaq?

Thanks

cuttlefishh commented 6 years ago

Hi! Thanks for your message. We hope the resource is useful to you.

For EMP Release 1, some of the studies did not have Read 2 data, and for those that did, we did not use it. This is because the amplicon size is ~253bp, and only the more recent studies, with read lengths 150-151bp, would be long enough to merge. Read 2 tends to have a higher error rate than Read 1, and merging the Read 1 and Read 2 sequences can also introduce uncertainty, which we wanted to avoid.

For some of the more recent studies, we do have Read 2 data, and it should be available Qiita (qiita.ucsd.edu) for some of them. We will try to come up with a list of which studies have these data available. Note that it will only be possible to merge sequences for studies with ~150bp reads, which is noted in the Release 1 mapping files; studies sequenced since 2015 that are in Qiita will also have reads ~150bp. We know that people are interested in the Read 2 data, and we will try to make this more accessible in the future!

Thanks! Luke

Cc: @antgonza @ackermag @walterst

peterjc commented 4 years ago

I assume "fastaq" was a typo for FASTQ.

Could you give one or two specific examples with paired end data (R1 and R2) suitable for overlap merging? Thanks!

cuttlefishh commented 4 years ago

These studies appear to have R1 and R2 data (plus index) and should be suitable for merging:

https://qiita.ucsd.edu/emp/study/description/10561 https://qiita.ucsd.edu/emp/study/description/10533

The studies numbered >10000 were sequenced after the start of 2015 and should have longer reads (150bp). If the reverse reads are present, it should be possible to merge forward and reverse reads.