Open jaclyn-taroni opened 6 years ago
I would rather ask upstream if they can implement a flag that would randomize first at a small cost of compute.
It seems a reasonable assumption that this should almost never occur in the wild. I am specifically couching my language here, because, as we all know, if something can happen in genomics, it will. However, the raw reads are expected to come off the sequencer in a sufficiently random order. The only real reason you might see ordered reads (that I can easily thing of), is if someone “back-converts” a sorted BAM to FASTQ files or some such. Hopefully, this is infrequent enough to ignore, at least in a first pass.
Alright cool, my suspicion was “almost never” :+1: I will close this.
FYI As it stands now, for SRA:
The second factor is that SRA stores mate-pairs together, so there can be some reordering to make that happen. Also some of the tools for reading SRA have options that can effect the output order.
There is a general need in bio-informatics for a tool to sort reads in random order. Unfortunately, neither FASTQ nor SAM is amenable to simply piping into sort -R
.
@durbrow : do you store (and it it made accessible) whether the submission was an aligned one or not?
@durbrow : also, do you know what fraction of submissions of RNA-seq samples are aligned vs. what the submitter sent?
@cgreene wrt RNA-seq, I'm sorry, I don't know that. I deal mostly with the technical side of things (I am one of the SRA devs). From the technical side, there is very little special that is done for RNA-seq in SRA. For example, we don't record in the file whether or not it is RNA-seq. That information is probably available from the metadata for the sample. I can put you in touch with our data curators? They may know, or be able to run a query to get the answer.
@durbrow : if there are folks there who could let us know how frequently this occurs and/or if there is a way to know that it has occurred so that we can run something slightly different in those cases, it would be wonderful to be put in touch with them. Thank you!
@cgreene wrt is aligned vs unaligned. Yes we do. We have tools that do provide that information. Our vdb-dump
tool can show you what the underlying database is, the tables it contains, the columns, data types, etc, as well as the data. In particular, all aligned SRA objects have a PRIMARY_ALIGNMENT
table, you can probably guess what it contains. No unaligned SRA objects contain that table. For example:
vdb-dump --table_enum SRR5054227
would give you a list of the tables in that run, one of them is PRIMARY_ALIGNMENT
We generally don't talk about these details. Our tools try to present a uniform interface to the data so that our users don't have to worry about it. So you can get FASTQ from an aligned submission, or SAM from an unaligned one.
@cgreene You should be able to contact the data curation team here sra@ncbi.nlm.nih.gov
BTW, in my opinion, this is the best way to get info about SRA objects. If you have a run accession, you can navigate the SRA hierarchy, find sample and project metadata, find other submissions associated with it, the articles they were published in, etc.
https://www.ncbi.nlm.nih.gov/sra/?term=SRR5054227
Change term=
X as needed for different accessions
I use a similar (but way less sophisticated) workflow that feeds SRA-derived FASTQ into Salmon for quantification. I've been trying to ignore the read order issue, but @durbrow's response makes it harder to do that, especially with this HowTo wiki encouraging sorted BAMs. (I have never submitted samples to SRA, so I haven't actually used that workflow.)
FWIW, this small-scale test I ran on the 39 tumor:adjacent normal pairs (78 observations each under sorted and shuffled conditions for a total of what should be 156 marks on the PCA) in the TCGA COAD project makes me think the impact of inadvertently using sorted input is not dramatic:
For TCGA single-end samples, I shuffled the reads with this one-liner from Aaron Quinlan, et al:
$awk -v OFS='\t' '{getline seq; getline sep; getline qual; print $0,seq,sep,qual}' ${name}.fastq | shuf | awk -v OFS='\n' '{print $1,$2,$3,$4}' > ${name}.sh.fastq
I used this Perl script from the chloroExtractorTeam to shuffle the paired-end reads.
I haven't implemented shuffling yet for SRA, but I'm starting to think I should. Judging from your repo, I think you will probably come up with something better than I will.
@dampierch : I'm curious - if you shuffled the reads twice, what does the run-to-run variability look like? Does the sorted version look just like an arbitrary ordering of reads, or is there something systematic going on there that makes the difference larger?
@cgreene: I shuffled each tissue sample's set of reads one time, but the paired nature of my dataset (tumor and tumor-adjacent "normal" tissue) resulted in two sets of reads for each subject; hence all the data points on my plot despite just 39 pairs in my comparison. I think I make this more confusing every time I write it. (My PI is looking forward to the end of this project.)
I am probably misunderstanding your question, but my (i.e. TCGA's) sorted reads are coordinate-sorted. I do not understand the EM algorithm enough to judge whether a potential bias on parameter estimation arising from processing a series of reads from (the beginning of) chromosome 1, then from chromosome 2, etc. is likely to have an effect on abundance estimates dramatic enough to have meaningful biological consequence. I'm skeptical, but as I'm ignorant, I'm also fearful. Do you have a view on this (i.e. parameter estimation and biological interpretation), @rob-p?
@rob-p, sorry for bringing this up again. i see you addressed it in 2015, 2016 (indirectly), and 2017. if you include @jaclyn-taroni's original opening of the present issue, you have addressed the problem every year for the past four. you now have the opportunity to keep your streak alive.
@cgreene, in case you have not already seen this sort-of-related old tweet, i bring it to your attention as a counterpoint to my PCA plot. I think FPKMs are not the right unit, and my relative expression summary is more relevant than an absolute value comparison, not to mention they were using eXpress, but I'm still impressed with the divergence in estimates driven by read order (if that's really what was going on). I'd like to do a more formal comparison of coordinate-sorted vs shuffled input.
Hi @dampierch, you certainly make a good argument for me to keep my streak alive :). There are, I think, a few points to address. The first is restricted purely to "alignment-based" salmon --- where the quantification is done directly from the aligned BAM file, while the second is the effect of observing the reads (or the corresponding alignments) in a given order. So, I'll address the potential issues in decreasing order of severity (i.e. how much they would bork the analysis):
<alignment of 1 (left end) to A>
<alignment of 1 (right end) to A>
<alignment of 1 (left end) to B>
<alignment of 1 (right end) to B>
<alignment of 1 (left end) to C>
<alignment of 1 (right end) to C>
if the alignments are not properly grouped like this, salmon is likely to write a bunch of verbose warnings to the log (i.e. both to cerr and log file) complaining about suspicious pairs.
Now, the good news. Once all reads have been observed, salmon begins the offline phase where it uses the (range-factorized) equivalence classes to iteratively re-estimate transcript abundance until those estimates converge. During the offline phase, the order in which the reads were observed has no specific effect, because what is being processed is a reduced representation of the mapping information. The effect of the read order on this phase happens only in two ways (1) the initialization of the offline VBEM algorithm is given by a combination of the online estimates of transcript abundance and the uniform (i.e. uninformative) abundance vector (2) the weights in the (range-factorized) equivalence classes are estimated under the auxiliary parameters learned during the online phase. However, the fact that this offline phase is always run, typically for hundreds or thousands of iterations, until convergence, means that the magnitude of the effect of mis-estimating transcript abundances by observing the reads in order during the online phase is greatly mitigated. That means that different read orders don't tend to lead to total failures of estimation as you observe in the tweet above (though, in that case, it's not clear that salmon would be happy anyway because the alignment records in the coordinate sorted BAM are not contiguous and the pairs are not adjacent). However, the exact magnitude of the effect is hard to analyze theoretically, and depends on the extent to which the auxiliary models are affected as well.
thank you, @rob-p, for your characteristically thorough and helpful response. may this constitute your last exposition on the read order issue for 2019.
Another year, another question about this: would it be sufficient to take my STAR aligned BAMs and run them through samtools collate?
samtools collate possibly_sorted_by_alignment.bam grouped_by_read_name
which should save grouped_by_read_name.bam
Hi @alanhoyle , I think the samtools collate
approach should work, although I haven't tried it. One issue that could give you trouble is that the QNAME
field may not always be available in the BAM file of interest. SRA often stores read names but does not guarantee read name retention. See this issue comment for a brief introduction to the read name problem in SRA.
Context
Salmon assumes that the reads you give it are in a random order -- see the first Note here in the Salmon docs.
Problem or idea
Because we won't know if the
fastq.gz
files from SRA have been ordered in some way without checking, we might want to shuffle the reads before giving it tosalmon quant
. It is not clear to me how frequently these files will be in a non-random order (if ever).Solution or next step
Based on my understanding of #157 (and by extension, how things are currently implemented), we have to unzip the
fastq.gz
file to determine the read lengths. That might be the time to shuffle the reads, but this needs a dev team perspective. Also, as noted above, I don't know how many samples this will effect.