SchulzLab / TEPIC

Annotation of genomic regions using transcription factor binding sites and epigenetic data
MIT License
40 stars 9 forks source link

Error during gene annotation #26

Closed HeyLifeHD closed 6 years ago

HeyLifeHD commented 6 years ago

When I run TEPIC with the following code:

bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \ -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed \ -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ \ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 \ -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 \ -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j

I keep on getting the following error message:

Filter regions that could not be annotated Generating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 689, in main() File "/home/heyj/TEPIC/Code/annotateTSS.py", line 564, in main tss,identifier=readGTF(args.gtf[0]) File "/home/heyj/TEPIC/Code/annotateTSS.py", line 38, in readGTF if (int(s[4]) > tss[s[9]][1]): TypeError: '>' not supported between instances of 'int' and 'tuple' Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_10_18_16_04_42_404135046_Decay_Peak_Features_Affinity_Gene_View.txt' rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_10_18_16_04_42_404135046_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory

All my bed and reference files contain the "chr" prefix.

I would appreciate the help.

Best,

Joschka

Florian411 commented 6 years ago

Hey Joschka,

Thanks for your post. It seems there is a problem in reading your gtf file. Could you please send me that file via mail or link it here such that i can have a look at it.

Best, Florian

HeyLifeHD commented 6 years ago

Hey Florian,

Thank you for your fast reply. I was already thinking that there could be some problems with the .gtf file, however I didn’t find any obvious mistakes. This is the link with the uploaded file: https://drive.google.com/open?id=1lSsoQkxk7wYBJRu2JGNWV7TpKRq4hCMc Thank you for your support Best,

Joschka

Florian411 commented 6 years ago

Hey again, Sure. I found the problem. When we wrote TEPIC we did not use it for transcript-specific annotation. In fact, I filter the gtf file for entries with the "gene" label in the third column. I will add a new option for the transcript analysis asap. I'll let you know when it is done. I have to make sure that it won't cause any problems anywhere else. Thanks for pointing it out! Florian P.S.: Please say "Hi" back to him :-)

Florian411 commented 6 years ago

Hey, Please pull the latest version and see if that fixed the issue. Please add to your command from above the flag -t

bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -t

Florian411 commented 6 years ago

I had to make one more change, in case you already pulled, please pull again.

HeyLifeHD commented 6 years ago

Thank you for the update. I will try it today and let you know.

HeyLifeHD commented 6 years ago

Hey Florian, I repeated the analysis with the updated TEPIC version and -t in the command. I obtained the following error:

bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \

-b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed > -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ > -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 > -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 > -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -tPreprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak setAdapting chr prefix in bed files for intersection with the reference genomeRunnig bedtools Converting invalid charactersStarting TRAP Filter regions that could not be annotatedGenerating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in main() File "/home/heyj/TEPIC/Code/annotateTSS.py", line 636, in main affinities,numberOfPeaks,peakLength=extractTF_Affinity(usedRegions,genesInOpenChromatin,args.affinity[0],tss,oC,decay,geneBody,addPeakFT,normaliseLength,motifLengths) File "/home/heyj/TEPIC/Code/annotateTSS.py", line 101, in extractTF_Affinity geneAffinities[geneID]=map(operator.div,numbers,map(lambda x: factor*(length-x+1) if (length-x+1 > 0) else 1, motifLength)) AttributeError: module 'operator' has no attribute 'div' Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_10_47_54_561824199_Decay_Peak_Features_Affinity_Gene_View.txt' rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_10_47_54_561824199_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory

In parallel I tried the analysis on gene level. For this I subsetted the .gtf file to contain only gene information. This time i obtained the following error message:

bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -e TRUE -j Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak set Adapting chr prefix in bed files for intersection with the reference genome Runnig bedtools Converting invalid characters Starting TRAP Filter regions that could not be annotated Generating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in main() File "/home/heyj/TEPIC/Code/annotateTSS.py", line 575, in main tss,identifier=readGTF(args.gtf[0],transcriptAnnotation) File "/home/heyj/TEPIC/Code/annotateTSS.py", line 36, in readGTF if (s[idIndex] in tss): IndexError: list index out of range Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed_TEPIC_08_14_18_11_32_48_957131277_Decay_Peak_Features_Affinity_Gene_View.txt' rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed_TEPIC_08_14_18_11_32_48_957131277_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or director

Could you please provide me an example on how the gtf file should look like to make the analysis work. Thank you very much.

Best,

Joschka

Florian411 commented 6 years ago

Hey Joschka, Thanks for trying again. The first issue is due to a changed operator in python 3, I changed the operator encoding and it should work now. The second problem is really due to the gtf file but i handle that now in the gtf reading. You can find an example gtf file in the Test folder. Although i think the issues should be handled now. I tried all the test cases (they run fine now also with python 3). Have you tried to run the tests? Do they produce any errors on your system? Best, Florian

HeyLifeHD commented 6 years ago

Hey Florian, I'm sorry but somehow it is still not running:

bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \

-b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak_sub.bed \ -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ \ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 \ -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 > -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -t Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak set Adapting chr prefix in bed files for intersection with the reference genome Runnig bedtools Converting invalid characters Starting TRAP ../..//Code/TEPIC.sh: line 342: /home/heyj/TEPIC/Code/TRAPmulti: No such file or directory Filter regions that could not be annotated Generating gene scoresNo TF affinities provided in /home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Affinity.txt.Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r")FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Decay_Peak_Features_Affinity_Gene_View.txt'rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory

The gtf file looks the following: chr1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 unprocessed_pseudogene exon 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; exon_number "1"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; exon_id "ENSMUSE00000848981"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"

For the gene version i get: bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -e TRUE -j Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak set Adapting chr prefix in bed files for intersection with the reference genome Runnig bedtools Converting invalid characters Starting TRAP ../..//Code/TEPIC.sh: line 342: /home/heyj/TEPIC/Code/TRAPmulti: No such file or directory Filter regions that could not be annotated Generating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in main() File "/home/heyj/TEPIC/Code/annotateTSS.py", line 575, in main tss,identifier=readGTF(args.gtf[0],transcriptAnnotation) File "/home/heyj/TEPIC/Code/annotateTSS.py", line 36, in readGTF if (s[idIndex] in tss): IndexError: list index out of range Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed_TEPIC_08_14_18_16_43_33_060433770_Decay_Peak_Features_Affinity_Gene_View.txt' rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed_TEPIC_08_14_18_16_43_33_060433770_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory Postprocessing TF affinities python Scripts/computeMeanRatioTFAffinities.py /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group1/ /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/ /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Mean/Mean_Affinities_group1.txt /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Mean/Mean_Affinities_group2.txt /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Ratio/Ratio_Affinities_group1_vs_group2.txt False TRUE There are no genes available Combining TF affinities with differential gene expression data python Scripts/integrateData.py /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Ratio/Ratio_Affinities_group1_vs_group2.txt /home/heyj/c010-datasets/Internal/COPD/RNASeq/GeneList.txt /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//IntegratedData/Log2/Integrated_Data_Log2_Quotient.txt Reading TF affinities from file: /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Ratio/Ratio_Affinities_group1_vs_group2.txt Traceback (most recent call last): File "Scripts/integrateData.py", line 75, in main() File "Scripts/integrateData.py", line 25, in main tfFile=open(args.affinities[0],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/Ratio/Ratio_Affinities_group1_vs_group2.txt' Rscript Scripts/prepareForClassificiation.R /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//IntegratedData/Log2/Integrated_Data_Log2_Quotient.txt /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//IntegratedData/Binary/Integrated_Data_For_Classification.txt Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//IntegratedData/Log2/Integrated_Data_Log2_Quotient.txt': No such file or directory Execution halted Starting differential learning. Please wait... Rscript Scripts/DYNAMITE.R --dataDir=/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//IntegratedData/Binary/ --outDir=/home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Learning_Results/ --out_var=Expression --Ofolds=3 --Ifolds=6 --performance=TRUE --alpha=0.1 --cores=4 Loading required package: ggplot2 Loading required package: gplots

Attaching package: ‘gplots’

The following object is masked from ‘package:stats’:

lowess

Loading required package: glmnet Loading required package: Matrix Loading required package: foreach Loaded glmnet 2.0-16

Loading required package: iterators Loading required package: parallel [1] "Samples to analyse: 0" [1] "Performance measures:" Error in mean(TestAcc) : object 'TestAcc' not found Calls: print -> paste -> mean Execution halted

The gtf file for the genes looks like this: chr1 HAVANA gene 3073253 3074322 . + . ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1 chr1 ENSEMBL gene 3102016 3102125 . + . ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3

Is it necessary to change to the exact layout of your gtf file? Thanks, Joschka

HeyLifeHD commented 6 years ago

Hey Florian, it seems to be working now. I guess the problem was the missing space after the semicolons in my reference file. I replaced them with sed 's/;/; /g' and now it seems to be working. Thank you again for your support. Best, Joschka

Florian411 commented 6 years ago

Great! I am glad it works now. Thanks for your patience and feedback. You have my number now, feel free to call if sth. else pops up. Best, Florian