SchulzLab / TEPIC

Annotation of genomic regions using transcription factor binding sites and epigenetic data
MIT License
40 stars 9 forks source link

TypeError in annotateTSS.py when using GTF #37

Open cardonaaaa opened 2 years ago

cardonaaaa commented 2 years ago

Hi, I'm trying to run TEPIC v2.2 with the following command:

./TEPIC.sh -g $ref/Ntab-BX_AWOK-SS.fa -b $outdir/14-5_out2_macs2_call_peak_peaks.narrowPeak -o $outdir/14-5_TEPIC_AWOK -p $psem_dir/Nicotiana_IDs_PSCM_JASPAR_2018.PSEM -a $ref/with_desc_Ntab-BX_AWOK-SS_Basma_rnaseq.gtf -c 60 -t

And getting this error:

Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats
Runnig bedtools
Converting invalid characters
Starting TRAP
Traceback (most recent call last):
File "/mnt/lustre/scratch/nlsas/home/csic/bds/jjf/TEPIC-2.2/Code/annotateTSS.py", line 1051, in <module> main()
File "/mnt/lustre/scratch/nlsas/home/csic/bds/jjf/TEPIC-2.2/Code/annotateTSS.py", line 886, in main
tss,identifier=readGTF(args.gtf[0],transcriptAnnotation)   
File "/mnt/lustre/scratch/nlsas/home/csic/bds/jjf/TEPIC-2.2/Code/annotateTSS.py", line 44, in readGTF
if (int(s[4]) > tss[s[idIndex]][1]):
TypeError: '>' not supported between instances of 'int' and 'tuple'

I tried using two different GTFs (since I'm carrying out my analysis with two different reference genomes) and I'm getting the same error message with both of them. I'd like to know if it's an issue with my GTFs not being "gencode format" or something about the script. None of my files contain the "chr" prefix.

1st GTF: https://drive.google.com/file/d/1hDwjUDY33nUkN_Zpp6YiKaAhmLsjn9p_/view?usp=share_link 2nd GTF (the one used in the shown command): https://drive.google.com/file/d/1pAwSjREFfQ_r7CftXOhrIJcpAZrXh0-J/view?usp=share_link

Python version: 3.7.8 R version: 4.0.4

Thanks in advance!

Florian411 commented 2 years ago

Hey,

Thanks for trying TEPIC. The first thing I noticed is that your GTF files do not seem to contain a "gene" entry, which is the default the script is looking for. If you want to annotate based on transcripts, you have to run the pipeline with the -t parameter.

I also noticed that your "chromsome" identifiers are not numeric. This is something we have never used or tested. Does your narrow peak file contain the same identifiers?

Please let me know if you still run into problems once you use the -t.

Cheers, Florian

cardonaaaa commented 1 year ago

Hey @Florian411

I used the -t parameter, it's at the end of the command. Anyway I ran it again and the same error keeps popping. My narrow peak file contains the same identifiers.

So it may be due to the scaffold names not being numeric.

Thanks for answering so quickly!

Florian411 commented 1 year ago

Could you share with me a preview (maybe first 10 lines) of each of your files via gdrive? I can try to debug a bit further then.

cardonaaaa commented 1 year ago

Sure, the full GTF files are in the first comment. I leave the first 10 lines of each file below. Tell me if you have any problems downloading them.

Edit: One thing to take into account is that one of the genomes I'm using is mixed, thus having both chromosomes and scaffolds that have different prefix:

Florian411 commented 1 year ago

Thanks for the files! I managed to download. Thanks also for the hint regarding the genome file. I'll see what i can find and get back asap.

Florian411 commented 1 year ago

I just tried it with your example files and it works fine on my end.

bash TEPIC.sh -g ~/dataToTest/example.fasta -b ~/dataToTest/example2.narrowPeak -o TestFiles -p ../PWMs/2.1/JASPAR_PSEMs/JASPAR2018_CORE_vertebrates_non-redundant_pfms_transfac.PSEM -c 4 -a ~/dataToTest/first.gtf -t

is the command i used to obtain the attached annotation. I don't get anything using the second.gtf file. TestFiles_TEPIC_11_12_22_21_16_41_832407535_Decay_Peak_Features_Affinity_Gene_View_Filtered.txt

Have you tried with the example files? As it works with those my guess is that there might be a problematic entry somewhere deeper down in the gtf that causes the annotation script to break.

cardonaaaa commented 1 year ago

Well that's weird. I tried with the same files as you and I'm still getting the error. Could it be something related with my python version or sth similar? Anyway it should not be an issue with the GTF, since the one you used is actually the entire file (first.gtf)

Another thing I notice about your output is that gene IDs only contain a slice of the full ID, in this case "Nitab4", which is not enough to identify the locus. Might be because of the "." in the transcript IDs (Nitab4.5_XXXXX) (?)

I'm gonna keep trying to detect what is producing my error, will be back. Thanks for the help again!

Florian411 commented 1 year ago

I am on python version 3.8.10. Possible that it makes a difference.

Yes, good guess. The annotation only considers the part before the "."

Let me know what you find or if I can help somehow else.

cardonaaaa commented 1 year ago

Hi again, I managed to make it work re-downloading the GFF and transforming it to GTF. So it was probably an issue with the GTF being corrupted, as you predicted. Anyway I'm not able to identify my genes because the annotation considers only the part before "." Is it possible to solve that issue so it considers the full annotation? Maybe making it consider just the last "." of the annotation if there's more than one instance (?)

I'll try to remove the "." from the narrowPeak, FASTA and GTF meanwhile, and see how it goes. But this could be a good feature to improve so future users with the same issue do not need to modify their original files.