alexchwong / SpliceWiz

SpliceWiz is an R package for exploring differential alternative splicing events in splice-aware alignment BAM files.
Other
13 stars 8 forks source link

Process intron over multiple chromosomes in .grlGaps when using BuildRef #48

Closed SWCode132 closed 1 year ago

SWCode132 commented 1 year ago

Hi!

Thank you for creating SpliceWiz, it seems very powerful. However, I was getting an error with my intended usage of BuildRef and traced it through the source code to your function .grlGaps. Specifically, transcripts from the pseudoautosomal regions of the X and Y chromosomes are found on both the X and Y chromosomes, e.g. ENST00000244174.11 encoding IL9R.

When one performs unlist on a GRanges List object, this will lead to multiple range lines being created for one item in the list: one for the range found on each chromosome (since they correspond to different seqLevels, they can't be merged into a single line). When the program then runs psetdiff, an error is thrown (Error in .local(x, y, ...) : 'x' and 'y' must have the same length). This makes complete sense, because the original object grl has a different number of transcripts than the number of ranges in unlist(range(grl)). I am not sure how you would want to work around this best, but it's not of a lot of interest to me so for now I'm just going to delete the duplicate Y-chromosome transcript annotations.

All the best!

SWCode132 commented 1 year ago

The workaround described above worked (that is, I found gene_id values duplicated on the Y and X chromosome, and deleted the Y chromosome versions of all annotations associated with the duplicated gene_id values). It would occur to any user using standard Gencode annotations. I hope it can be fixed for future users, or at least this thread remains active for people to see the workaround. Thanks!

alexchwong commented 1 year ago

Thanks for the bug report. I wasn't aware that a gene could reside on more than 1 chromosome.

@SWCode132 Could you suggest links to an example Gencode GTF for which this problem occurs so I can investigate further? I did not previously have problems with Gencode annotations but have not recently tried to verify this.

SWCode132 commented 1 year ago

Hi Alex,

I had the issue with Gencode v43 gff3 (which I converted to gtf but the problem is in the original gff), but is dealt with differently in the v44 version. Bizarrely, the problem is replicable if you download the gff3 from Gencode V43 today (https://www.gencodegenes.org/human/release_43lift37.html and e.g. search for ENST00000244174.11 or I think it's ENST00000244174.11_6) but not if you download the gtf, where the transcript ID is appended to indicate it's the Y chromosome version. The X and Y chromosome versions are given different IDs but assigned the same transcript ID and gene ID.

The problem should be unique to these chromosomes, as they are different chromosomes with a homologous region (in all other cases, the homologous chromosome is interchangeable with the original chromosome). It's surprising that the issue is present only in gff3 but not gtf files from Gencode V43 website... strange. In V44 it seems that an entirely different transcript ID number is assigned to the Y chromosome version, so it wouldn't be a problem. So I don't know if it's worth fixing or just noting for SpliceWiz, as it will only happen to someone who started downloading the V43 gff3.

alexchwong commented 1 year ago

Thanks for investigating @SWCode132 ! Seems like the problem lies with the Gencode annotations. Good to know that such errors can arise. At some stage I will implement some sanity checks of input GTF files upstream to the reference building

alexchwong commented 1 year ago

Version 1.3.4 should fix this issue by calling an informative error if multiple seqnames per gene_id or transcript_id are found. Should be available via Bioc 3.18 in coning days. Closing this issue now.