SystemsGenetics / GEMmaker

A workflow for construction of Gene Expression count Matrices (GEMs). Useful for Differential Gene Expression (DGE) analysis and Gene Co-Expression Network (GCN) construction
https://gemmaker.readthedocs.io/en/latest/
MIT License
33 stars 16 forks source link

StringTie prepDE.py script needs length argument for calculating raw counts #266

Open JohnHadish opened 1 year ago

JohnHadish commented 1 year ago

Description of the bug

Impacts Hisat2 and Star aligner. The prepDE.py script is used to calculate raw counts. It has a parameter called -l (length) which requires the user to input the average length of each read. It defaults to 75 bp. Currently, GEMmaker does not reset this parameter. This will result in incorrect raw read counts being reported to the user.

the code for prepDE.py

http://ccb.jhu.edu/software/stringtie/dl/prepDE.py3

the code which needs to be changed:

https://github.com/SystemsGenetics/GEMmaker/blob/master/modules/local/stringtie_fpkm_tpm.nf

Suggested change: pass an argument with readlength to stringtie, use this parameter

Command used and terminal output

No response

Relevant files

No response

System information

No response

spficklin commented 1 year ago

This is a problem and should be fixed, but I don't think it means GEMmaker is delivering counts that are not "correct". Even if the average read length was given properly (not the 75bp default), the count is still technically not "correct". It's an estimate using the formula reads_per_transcript = coverage * transcript_len / read_len. For example, for trimmed reads you can still have those that are shorter or longer than the average so the count won't be perfect.

I think the problem isn't with the counts being incorrect, it's more of a challenge of being able to compare across samples that have different average read lengths.

If your input files are all from the same experiment odds are all of the read lengths across the samples will be the same -- no problem.

If your input files are from samples that have different read lengths then a cross-sample normalization, I think, should clear things up. Folks with larger datasets should be checking for this and normalizing.

if your sample file is absolutely huge, then it's hard to normalize and there may be some comparison problems. But correlation analysis should still be okay (with some added noise).