czbiohub-sf / tabula-muris

Code and annotations for the Tabula Muris single-cell transcriptomic dataset.
https://www.nature.com/articles/s41586-018-0590-4
BSD 3-Clause "New" or "Revised" License
185 stars 90 forks source link

Where can I find the raw unprocessed reads? #223

Closed FalkoHof closed 3 years ago

FalkoHof commented 4 years ago

Hey, is there a possibility to download the tabula muris data as unaligned, not adapter trimmed data from AWS, or is this only possible on SRA?

Ich checked the AWS bucket but both folders '10x_bam_files' and 'facs_bam_files' seem to only contain bam files with the *.Aligned.out.sorted.bam suffix. And in this case I would assume that even when unaligned reads are kept, the reads have all been preprocessed/trimmed?

Thanks! Falko

aopisco commented 4 years ago

@jamestwebber might be able to help

jamestwebber commented 4 years ago

The data on AWS is not adapter trimmed–the insert size for these reads is pretty large so there was no need to do so.

FalkoHof commented 4 years ago

Thanks for the response! I also assume discordant reads were kept, so I could get all original reads from the bam file?

jamestwebber commented 4 years ago

I believe so, but I haven't verified this by doing the round-trip back to a fastq file and verifying...if you give it a try we could compare read #s with the original fastq files.

We discussed with the AWS public data folks and decided to upload only BAM files because we thought that would cover everyone's needs. But if they are missing reads (maybe chimeric reads or something) that might need to be changed.

FalkoHof commented 4 years ago

Thanks for the swift reply! I will download a few and report back in latest about a week or so. Best, Falko

FalkoHof commented 4 years ago

We checked three files: SRR6571079 / A1-MAA000400-3_8_M-1-1 SRR6571474 / A10-MAA000586-3_8_M-1-1 SRR6571475 / A11-MAA000586-3_8_M-1-1

See below. The fastq column contains the number of reads as reported in "spots read" by fasterq-dump [SRR]. The bam column shows the number of unique read pairs from the aws bam files as reported by samtools fastq [BAM]

Total read pairs: SRR FASTQ BAM SRR6571079 3288555 3048963 SRR6571474 1284467 1175468 SRR6571475 2070727 1903004

So it seems that the bam files do not contain all reads?

jamestwebber commented 4 years ago

Hm you must be right. The BAM files on AWS are the output from STAR, and my best guess is that it split out reads that it identified as chimeras or splice junctions. That's unfortunate--it would have been nice to have the raw files in that resource, rather than only in SRA.

FalkoHof commented 4 years ago

Do you have any plans of uploading the raw files to aws? Having either fastq/unaligned bam files as well would be awesome!

jamestwebber commented 4 years ago

We ended up uploading the BAM files after discussion with the AWS public data team, but we didn't realize we'd be missing out on a small number of reads that are potentially interesting. Given that AWS is hosting the data for us I don't think there's a plan to add the fastq files as well, but they should be available from SRA if you want them.

cjmielke commented 4 years ago

I'm using the 10x "bamtofastq" tool to attempt recovery of the original fastq reads from the AWS bam files. The tool runs, but the barcodes are found in the R "reads" files, and these only contain 26 nt total for all reads.

This is my first time playing with this type of data, so I may be out of bounds here, but I'm guessing these BAMs aren't useful for recovering the raw reads with 10x's tool ?

jamestwebber commented 4 years ago

For the 10x data, I believe the bamtofastq tool should produce a 26bp R1 file containing the barcode and UMI, an 8bp file of indexes, and an R2 file which contains the actual mRNA read. Is that what you're getting?

That tool isn't going to work on the SmartSeq2 BAM files (I'm not sure what it will produce).

cjmielke commented 4 years ago

You're right! I stand corrected. I just found those R2 files and came back here to delete my comment, but you beat me to it :p Thank you!