churchill-lab / emase-zero

C++ Implementation of EMASE
http://churchill-lab.github.io/emase-zero/
MIT License
6 stars 2 forks source link

ERROR LOADING TRANSCRIPT LENGTH FILE ; UNKNOWN HAPLOTYPE NAME: PAR #6

Open jon4thin opened 1 month ago

jon4thin commented 1 month ago

Hello! I am running into an issue with emase-zero as implemented in kbchoi/emase:latest, getting the error: ERROR LOADING TRANSCRIPT LENGTH FILE ; UNKNOWN HAPLOTYPE NAME: PAR

The issue is that if I remove these transcripts from the TRANSCRIPT LENGTH FILE, I get the WARNGING:

2024-08-15T19:22:18.685284122Z WARNING!
2024-08-15T19:22:18.685304588Z LOADING TRANSCRIPT LENGTH FILE /sbgenomics/Projects/daa73588-c56b-406b-95d0-845b472104dc/_1_Diploid_noHLA_1-01553_Asterisks.txlengths_noPAR.txt
2024-08-15T19:22:18.685308557Z EXPECTED 473950 VALUES BUT FILE CONTAINS 473632
2024-08-15T19:22:19.034095545Z No mapping information for ENST00000431238.7_PAR_Y

The warning lines keep printing ("No mapping information for" + the next ENST in the .EC file that has such label) 162 times, prints an empty line, and then the command fails on me with error code 1.

Any insights?

jon4thin commented 1 month ago

Jerry-rigged solution (still needs to be tested, will update afterwords): I can go into the .bam and grep out the ENSTs that are PAR_Y and the original ENST, so that when I use samtools Faidx to get the .fai file to get the transcript length from, neither file has these transcripts. Then, when I make the gene mapping files with biomaRt in R, I can filter out all the transcripts that are not present in the transcripts length file.