Xinglab / espresso

Other
48 stars 4 forks source link

Ensembl b38 HLA error. #22

Open sridhar0605 opened 1 year ago

sridhar0605 commented 1 year ago

Hi @EricKutschera ,

Using Ensembl gtf and fa with HLA contigs. I see the below error with ESPRESSO_S.pl step

[Thu May 18 21:40:23 2023] Summarizing annotated splice junctions for each read group
HLA-DRB1*03:01:01:02 ne HLA-DRB1*03: HLA-DRB1*03:01:01:02:12301:13089 at /bin/espresso/src/ESPRESSO_S.pl line 462.

Any thoughts? FWIW test data in the repo works fine.

Thank you. Sid

sridhar0605 commented 1 year ago

For any that would run in to this issue, can confirm removing HLA contigs solved the issue.

grep -v 'HLA-' input.sam > input_filtered.sam

EricKutschera commented 1 year ago

Here's the line for that error: https://github.com/Xinglab/espresso/blob/v1.3.2/src/ESPRESSO_S.pl#L462

ESPRESSO tries to keep some information in a string with : as a separator. Specifically it gives an ID to splice junctions like {chr}:{start}:{end}. Later it tries to parse that ID string, but that fails if the contig has : in the name

In this case HLA-DRB1*03:01:01:02:12301:13089 is the splice junction ID and the parts are HLA-DRB1*03:01:01:02, 12301, and 13089. ESPRESSO ends up thinking the part up to the first : (HLA-DRB1*03) is the contig name

Ideally ESPRESSO should be able to handle any contig name. I'll see if I can change this behavior

sridhar0605 commented 1 year ago

Thanks, but my inclination was may be something to do with string/regex expansion. I tried hacking the script following perl regex but failed.

Thanks for looking in to this.

Sid