WGLab / LongReadSum

MIT License
12 stars 2 forks source link

Add RNA-Seq TIN QC support #56

Closed jonperdomo closed 2 weeks ago

jonperdomo commented 1 month ago

Add TIN values for RNA-Seq QC from BAM files, including unit tests.

jonperdomo commented 1 month ago

I test with a GTEx RNA-seq file GTEX-14BMU-0526-SM-5CA2F_rep.FAK93376.bam and compared results with RSeQC. RSeQC TIN.py has default parameters for minimum coverage and sample size, and thus I implement both these parameters for direct comparisons, so that users can expect identical results as RSeQC. For transcripts, I download the latest GENCODE v46 file of basic gene annotations for the GRCh38 reference chromosomes, gencode.v46.basic.annotation.bed from https://www.gencodegenes.org/human/release_46.html

I set minimum coverage to 2, and sample size to 100. RSeQC:

tin.py -i "${mod_bam}" -r "${bed_file}" -c 2 -n 100
Number of scores: 67069
Mean TIN: 67.089549182989
Median TIN: 74.25578864168884
Standard deviation of TIN: 26.001131242677577

LongReadSum:

longreadsum bam -i "${mod_bam}" -o "${output_dir}" -t 12 --genebed "${bed_file}" --min-coverage 2 --sample-size 100
Number of scores: 67069
Mean TIN: 67.0683
Median TIN: 74.25
Standard deviation of TIN: 26.0379
jonperdomo commented 1 month ago

This PR will also address the help text error from issue #57

jonperdomo commented 1 month ago

Updated results with high precision.

TIN Results

RSeQC:

tin.py -i "${mod_bam}" -r "${bed_file}" -c 2 -n 100
Number of scores: 67069
Mean TIN: 67.089549182989
Median TIN: 74.25578864168884
Standard deviation of TIN: 26.001131242677577

LongReadSum:

longreadsum bam -i "${mod_bam}" -o "${output_dir}" -t 12 --genebed "${bed_file}" --min-coverage 2 --sample-size 100
Number of scores: 67069
Mean TIN: 67.06832655372376
Median TIN: 74.24996965188242
Standard deviation of TIN: 26.03788585287367

Performance comparison (--mem=50G, --cpus-per-task=8, --time=12:00:00) with seff:

RSeQC:

Nodes: 1
Cores per node: 8
CPU Utilized: 07:55:21
CPU Efficiency: 12.45% of 2-15:39:12 core-walltime
Job Wall-clock time: 07:57:24
Memory Utilized: 166.25 MB
Memory Efficiency: 0.32% of 50.00 GB

LongReadSum:

Nodes: 1
Cores per node: 8
CPU Utilized: 02:48:34
CPU Efficiency: 12.67% of 22:10:56 core-walltime
Job Wall-clock time: 02:46:22
Memory Utilized: 5.91 GB
Memory Efficiency: 11.83% of 50.00 GB
jonperdomo commented 1 month ago

Add a unit test to complete this PR.

jonperdomo commented 2 weeks ago

This PR adds a new feature for calculating TIN scores, yielding the scores and their summary statistics in TSV format, and adding this summary to the html report:

image