RitchieLabIGH / IRFinder

MIT License
13 stars 10 forks source link

IRFinder failing to build reference with UCSC format files #19

Open gabrieljaykay opened 2 years ago

gabrieljaykay commented 2 years ago

Hello,

I've been trying to build an IRFinder reference to look for intron retention in RNA-seq data stored in BAM files that are in the ucsc format, so I need to build a reference using the UCSC genome sequence instead of Ensembl. I'm running the BuildRefProcess while including the fasta and gtf files from the UCSC in the same folder and naming them 'genome.fa' and 'transcripts.gtf' respectively. I keep getting this error, and am uncertain as to why it would be running into a specific issue for this chromosome. Any help you could provide would be greatly appreciated. buildref_error

CloXD commented 2 years ago

Hello! Sorry for the inconvenience. Could you check the presence of the chr10_KN196480v1_fix in the file STAR/chrNameLength.txt? If it's not present, that might be the issue. The chrNameLenght.txt is created by STAR and is used in one of the reference build steps and it should contain it if it was in the genome.fa file. Is it possible that your genome.fa file contains only the canonical chromosome and not the fix patch? If so, you can replace the genome.fa with the full one or filter the transcripts.gtf to keep only the canonical ones. With awk should be something like:

cp ./transcripts.gtf ./transcripts_bkup.gtf
awk ' $1 ~ /^chr[0-9XY]+$/ {print } '  ./transcripts_bkup.gtf > ./transcripts.gtf 

Let me know if this helped. Cheers, Claudio