RitchieLabIGH / IRFinder

MIT License
13 stars 10 forks source link

Incorrect intron start/stop coordinates and partially incorrect static warning #18

Closed Waldehead closed 2 years ago

Waldehead commented 2 years ago

Dear Claudio,

i'm running IRFinder using the docker image 2.0.1. I stumpled across wrong start coordinates from the introns. The start coordinates are shifted by 1 base towards the 5' end.

Github_1

(The same gtf used for IR-detection was used in the IGV-Screenshot.)

Additionally some introns are labelled "clean" in the static warning altough overlapping with another feature (e.g. promoter like in the provided screenshot).

Github_2

gtf used: Mus_musculus.GRCm39.108.chr.gtf genome used: Mus_musculus.GRCm39.dna.primary_assembly.fa

I was able to work around those errors by correcting the positions and using chipseekers annotatePeak to annotate the detected IRs as "peaks" and using the column "annotation" to filter.

Additionally i used IRFinder on a human mRNA-dataset, same problems, so it's not a mouse-specific problem.

Otherwise IRFinder works like a charm.

Kind regards Max

CloXD commented 2 years ago

Hello Max, first of all, thanks for using IRFinder and I'm glad it works like a charm for you :)

For what concerns the shifted bases, you see this behaviour because IGV uses 1-indexed coordinate ( that means, the first nucleotide is couted a being in position 1 ), meanwhile IRFinder uses 0-indexed coordinate ( as mentioned in the Wiki ) . If you want to shift all the position, here's a quick awk code:

awk 'BEGIN {FS=OFS="\t"} NR == 1 {print } NR > 1 {$2=$2+1 ; $3 = $3 +1; print} ' ./IRFinder-IR-nondir.txt > ./IRFinder-IR-nondir-shifted.txt

For what concerns the "clean" attribution, I doubled check the original pearl script (IRFinder/bin/util/IntronExclusion.pl) for the details... It's still the same from the first version written by my colleague.

If the feature is on the same strand, the region is excluded from the reference for that intron. The "clean" means that, after removing the overlapping features on the same strand, more than 70% of the original intron has not been masked and its length is greater than 40, otherwise the intron is not considered. In case there is an overlapping feature on the opposite strand, it's marked as anti-near or anti-over ( depending on if is overlapping or is just in the neighbours ). In case an exon is covering the whole intron on the same strand, it's not masked but it's marked as known-exon.

I hope this clarified the situation and I apologize for the lack of a proper description. I'll add those few lines in the wiki. Cheers, Claudio

Waldehead commented 2 years ago

Thank you very much!

Waldehead commented 2 years ago

Hello Claudio, it's Max again.

Your awk code worked for me. At least at the 5' site. The 3' site is now annotated as part of an exon (with ChIPseeker, IGV also shows an overlap). image

It seems like only the 5' site is 0-indexed, so I just removed the +1 of the 3' site. After this, ChIPseeker annotates them as an intron. I'm mentioning this here in case someone stumbles upon this. (and if this is not intended and you want to fix this).

Cheers, Max

CloXD commented 2 years ago

Hi Max, yes, that was as intended. As mentioned in the Wiki:

0-indexed coordinate indicating the end of an intronic region. Please note, the range of an intronic region is half-open, e.g.[start,end), follows the same rule of a BED format.

Thanks for pointing this out anyways. Cheers Claudio