BD2KGenomics / toil-rnaseq

UC Santa Cruz Computational Genomics Lab's Toil-based RNA-seq pipeline
Apache License 2.0
40 stars 10 forks source link

Should PAR locus on the Y chromosome be removed in RSEM pipeline? #172

Open wangshun1121 opened 5 years ago

wangshun1121 commented 5 years ago

Hello Dr Vivian:

I developed a raw pipeline following your instruction. I checked the RSEM gene expression results on Xena:

https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_fpkm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443

You have 50 more gene IDs than mine. Then I checked gene IDs between yours and mine, and found that all these 50 gene IDs are begin with ENSGR. What's more, these genes are all from PAR locus on the Y chromosome.

Your results indicate that you didn't remove these PAR locus in RSEM pipeline as you did in Kallisto. In my opinion, these genes from PAR locus should also be removed in RSEM pipelines.

wangshun1121 commented 5 years ago

The 50 genes from PAR locus are here, from your gene table

id gene chrom chromStart chromEnd strand
ENSGR0000228572.6 LL0YNC03-29C1.1 chrY 253743 255091 +
ENSGR0000182378.12 PLCXD1 chrY 276322 303356 +
ENSGR0000178605.12 GTPBP6 chrY 304529 318819 -
ENSGR0000226179.5 LINC00685 chrY 320990 321851 +
ENSGR0000167393.16 PPP2R3B chrY 333963 386955 -
ENSGR0000281849.2 RP13-465B17.4 chrY 386980 405579 +
ENSGR0000275287.4 Metazoa_SRP chrY 388100 388389 -
ENSGR0000280767.2 RP13-465B17.5 chrY 419157 421980 +
ENSGR0000234958.5 FABP5P13 chrY 523775 524102 -
ENSGR0000229232.5 KRT18P53 chrY 545236 545352 -
ENSGR0000185960.12 SHOX chrY 624344 659411 +
ENSGR0000237531.5 RP11-309M23.1 chrY 990221 994365 +
ENSGR0000225661.6 RPL14P5 chrY 1008503 1010101 -
ENSGR0000205755.10 CRLF2 chrY 1187549 1212750 -
ENSGR0000198223.14 CSF2RA chrY 1268800 1310381 +
ENSGR0000264510.5 BX649553.3 chrY 1291755 1291828 +
ENSGR0000264819.5 BX649553.4 chrY 1292094 1292167 +
ENSGR0000263980.5 BX649553.2 chrY 1293615 1293689 +
ENSGR0000265658.5 MIR3690 chrY 1293918 1293992 +
ENSGR0000263835.5 BX649553.1 chrY 1294132 1294206 +
ENSGR0000223274.5 RNA5SP498 chrY 1300256 1300375 -
ENSGR0000185291.10 IL3RA chrY 1336616 1382689 +
ENSGR0000169100.12 SLC25A6 chrY 1386152 1392724 -
ENSGR0000236871.6 LINC00106 chrY 1396427 1399402 +
ENSGR0000236017.7 ASMTL-AS1 chrY 1401769 1414028 +
ENSGR0000169093.14 ASMTL chrY 1403139 1453762 -
ENSGR0000182162.9 P2RY8 chrY 1462572 1537107 -
ENSGR0000197976.10 AKAP17A chrY 1591593 1602514 +
ENSGR0000196433.11 ASMT chrY 1615001 1643081 +
ENSGR0000223511.5 RP13-297E16.4 chrY 1732584 1755985 +
ENSGR0000234622.5 RP13-297E16.5 chrY 1767347 1768776 +
ENSGR0000169084.12 DHRSX chrY 2219516 2502805 -
ENSGR0000223571.5 DHRSX-IT1 chrY 2334295 2336410 -
ENSGR0000214717.9 ZBED1 chrY 2486414 2500967 -
ENSGR0000277120.4 MIR6089 chrY 2609191 2609254 +
ENSGR0000223773.6 CD99P1 chrY 2609348 2657229 +
ENSGR0000230542.5 LINC00102 chrY 2612988 2615347 -
ENSGR0000002586.17 CD99 chrY 2691179 2741309 +
ENSGR0000168939.10 SPRY3 chrY 56954332 56968979 +
ENSGR0000237801.5 AMD1P2 chrY 57015105 57016096 -
ENSGR0000237040.5 DPH3P2 chrY 57062156 57062405 +
ENSGR0000124333.14 VAMP7 chrY 57067813 57130289 +
ENSGR0000228410.5 TCEB1P24 chrY 57165512 57165845 -
ENSGR0000223484.6 TRPC6P chrY 57171890 57172769 -
ENSGR0000124334.16 IL9R chrY 57184101 57197337 +
ENSGR0000270726.5 AJ271736.10 chrY 57190738 57208756 +
ENSGR0000185203.11 WASIR1 chrY 57201143 57203357 -
ENSGR0000182484.14 WASH6P chrY 57207346 57212230 +
ENSGR0000276543.4 AJ271736.1 chrY 57209151 57209218 +
ENSGR0000227159.7 DDX11L16 chrY 57212184 57214397 -
wangshun1121 commented 5 years ago

My suggestion: These genes should be paired with their homogenous genes on chrX. Then expression value of these genes should be added to their homogenous genes, and then remove these genes on chrY from meta table of gene expression.

jvivian commented 5 years ago

Dear @wangshun1121 ,

My suggestion: These genes should be paired with their homogenous genes on chrX. Then expression value of these genes should be added to their homogenous genes, and then remove these genes on chrY from meta table of gene expression.

Thank you immensely for this write-up and corresponding table. Due to demands that the workflow produce deterministic expression results that match the values on Xena, I have to freeze the default workflow inputs, but I've wanted to start hosting new sets of inputs as new Gencode annotations get released and the issues you've brought up are important points to consider.

I'll keep this issue open until I've properly addressed it. Please let me know if you have any other suggestions or improvements I can make, they're greatly appreciated.