getzlab / rnaseqc

Fast, efficient RNA-Seq metrics for quality control and process optimization
Other
146 stars 19 forks source link

Read Length fix #78

Closed kachulis closed 1 year ago

kachulis commented 1 year ago

Max read length is tested against alignment span (end - start), so can end up being incorrect. This PR fixes this, but testing against true read length.

agraubert commented 1 year ago

alignmentSize is really only used for checking an edge case in legacy mode. You may notice that on line 277 we use alignmentSize rather than alignment.Length() for overriding the readLength, which I also admit is strange, but I can't convince myself that it changes the behavior in any meaningful way (other than perhaps overwriting readLength more often than is necessary). readLength itself is then only used at the end where it is reported directly.

kachulis commented 1 year ago

@agraubert I think that is correct, IF the data being used has a constant read length. If that is not the case, due to trimming, or a technology which produces variable length reads, then the readLength metric will end up being somewhat randomly chosen from the distribution of observed read lengths.

I guess the question is what a good behavior would be in these variable read length cases. This PR is more intended to address the trimming case, with the thought being that probably the length of the longest observed read is the most meaningful value for "read length", with all shorter reads having been trimmed down from there. In the case where the sequencing tech itself produces variable read length, this still doesn't quite work, but I'm not sure there's really a good option for representing read length with a single metric in that case anyway.

agraubert commented 1 year ago

I see your point. It's intended to be the maximum observed read length, but as currently implemented, I suppose it's the read length of the maximum spanning read. We never ran into problems during testing because this was designed against fixed-length 150bp illumina reads. I'll patch that line when I get a chance