Closed HeyLifeHD closed 6 years ago
Hey Joschka,
Thanks for your post. It seems there is a problem in reading your gtf file. Could you please send me that file via mail or link it here such that i can have a look at it.
Best, Florian
Hey Florian,
Thank you for your fast reply. I was already thinking that there could be some problems with the .gtf file, however I didn’t find any obvious mistakes. This is the link with the uploaded file: https://drive.google.com/open?id=1lSsoQkxk7wYBJRu2JGNWV7TpKRq4hCMc Thank you for your support Best,
Joschka
Hey again, Sure. I found the problem. When we wrote TEPIC we did not use it for transcript-specific annotation. In fact, I filter the gtf file for entries with the "gene" label in the third column. I will add a new option for the transcript analysis asap. I'll let you know when it is done. I have to make sure that it won't cause any problems anywhere else. Thanks for pointing it out! Florian P.S.: Please say "Hi" back to him :-)
Hey, Please pull the latest version and see if that fixed the issue. Please add to your command from above the flag -t
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -t
I had to make one more change, in case you already pulled, please pull again.
Thank you for the update. I will try it today and let you know.
Hey Florian, I repeated the analysis with the updated TEPIC version and -t in the command. I obtained the following error:
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \
-b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed > -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ > -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 > -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 > -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -tPreprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak setAdapting chr prefix in bed files for intersection with the reference genomeRunnig bedtools Converting invalid charactersStarting TRAP Filter regions that could not be annotatedGenerating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in
main() File "/home/heyj/TEPIC/Code/annotateTSS.py", line 636, in main affinities,numberOfPeaks,peakLength=extractTF_Affinity(usedRegions,genesInOpenChromatin,args.affinity[0],tss,oC,decay,geneBody,addPeakFT,normaliseLength,motifLengths) File "/home/heyj/TEPIC/Code/annotateTSS.py", line 101, in extractTF_Affinity geneAffinities[geneID]=map(operator.div,numbers,map(lambda x: factor*(length-x+1) if (length-x+1 > 0) else 1, motifLength)) AttributeError: module 'operator' has no attribute 'div' Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r") FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_10_47_54_561824199_Decay_Peak_Features_Affinity_Gene_View.txt' rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_10_47_54_561824199_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory
In parallel I tried the analysis on gene level. For this I subsetted the .gtf file to contain only gene information. This time i obtained the following error message:
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -e TRUE -j
Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats
Filter total peak set
Adapting chr prefix in bed files for intersection with the reference genome
Runnig bedtools
Converting invalid characters
Starting TRAP
Filter regions that could not be annotated
Generating gene scores
Traceback (most recent call last):
File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in
Could you please provide me an example on how the gtf file should look like to make the analysis work. Thank you very much.
Best,
Joschka
Hey Joschka, Thanks for trying again. The first issue is due to a changed operator in python 3, I changed the operator encoding and it should work now. The second problem is really due to the gtf file but i handle that now in the gtf reading. You can find an example gtf file in the Test folder. Although i think the issues should be handled now. I tried all the test cases (they run fine now also with python 3). Have you tried to run the tests? Do they produce any errors on your system? Best, Florian
Hey Florian, I'm sorry but somehow it is still not running:
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \
-b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak_sub.bed \ -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ \ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 \ -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 > -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j -t Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats Filter total peak set Adapting chr prefix in bed files for intersection with the reference genome Runnig bedtools Converting invalid characters Starting TRAP ../..//Code/TEPIC.sh: line 342: /home/heyj/TEPIC/Code/TRAPmulti: No such file or directory Filter regions that could not be annotated Generating gene scoresNo TF affinities provided in /home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Affinity.txt.Filter genes that could not be annotated Traceback (most recent call last): File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in
main() File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main infile=open(sys.argv[1],"r")FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Decay_Peak_Features_Affinity_Gene_View.txt'rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_14_18_16_43_46_507373116_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory
The gtf file looks the following: chr1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"; chr1 unprocessed_pseudogene exon 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; exon_number "1"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; exon_id "ENSMUSE00000848981"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF"
For the gene version i get:
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/GENE_MODEL//Affinities/group2/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM -c 2 -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -w 50000 -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/gencode.vM18.annotation_genes.gtf -e TRUE -j
Preprocessing region file: Removing chr prefix, sorting regions and removing duplicats
Filter total peak set
Adapting chr prefix in bed files for intersection with the reference genome
Runnig bedtools
Converting invalid characters
Starting TRAP
../..//Code/TEPIC.sh: line 342: /home/heyj/TEPIC/Code/TRAPmulti: No such file or directory
Filter regions that could not be annotated
Generating gene scores
Traceback (most recent call last):
File "/home/heyj/TEPIC/Code/annotateTSS.py", line 700, in
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
Loading required package: glmnet Loading required package: Matrix Loading required package: foreach Loaded glmnet 2.0-16
Loading required package: iterators Loading required package: parallel [1] "Samples to analyse: 0" [1] "Performance measures:" Error in mean(TestAcc) : object 'TestAcc' not found Calls: print -> paste -> mean Execution halted
The gtf file for the genes looks like this: chr1 HAVANA gene 3073253 3074322 . + . ID=ENSMUSG00000102693.1;gene_id=ENSMUSG00000102693.1;gene_type=TEC;gene_name=RP23-271O17.1;level=2;havana_gene=OTTMUSG00000049935.1 chr1 ENSEMBL gene 3102016 3102125 . + . ID=ENSMUSG00000064842.1;gene_id=ENSMUSG00000064842.1;gene_type=snRNA;gene_name=Gm26206;level=3
Is it necessary to change to the exact layout of your gtf file? Thanks, Joschka
Hey Florian, it seems to be working now. I guess the problem was the missing space after the semicolons in my reference file. I replaced them with sed 's/;/; /g' and now it seems to be working. Thank you again for your support. Best, Joschka
Great! I am glad it works now. Thanks for your patience and feedback. You have my number now, feel free to call if sth. else pops up. Best, Florian
When I run TEPIC with the following code:
bash ../..//Code/TEPIC.sh -g /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/seq/mm10.fa \ -b /home/heyj/c010-datasets/Internal/COPD/ATAC//set1/processing/output_tg/peak/macs2/overlap/conservative_set/output_tg_rep1-rep2.naive_overlap.narrowPeak.bed \ -o /home/heyj/c010-datasets/Internal/COPD/TEPIC/ \ -p /home/heyj/TEPIC/PWMs/2.1/Merged_PSEMs/Merged_JASPAR_HOCOMOCO_KELLIS_Mus_musculus.PSEM \ -c 2 \ -a /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -w 50000 \ -f /home/heyj/ngs_share/bcbio/genomes/Mmusculus/mm10/rnaseq/ref-transcripts.gtf -e TRUE -j
I keep on getting the following error message:
Filter regions that could not be annotated Generating gene scores Traceback (most recent call last): File "/home/heyj/TEPIC/Code/annotateTSS.py", line 689, in
main()
File "/home/heyj/TEPIC/Code/annotateTSS.py", line 564, in main
tss,identifier=readGTF(args.gtf[0])
File "/home/heyj/TEPIC/Code/annotateTSS.py", line 38, in readGTF
if (int(s[4]) > tss[s[9]][1]):
TypeError: '>' not supported between instances of 'int' and 'tuple'
Filter genes that could not be annotated
Traceback (most recent call last):
File "/home/heyj/TEPIC/Code/filterGeneView.py", line 27, in
main()
File "/home/heyj/TEPIC/Code/filterGeneView.py", line 16, in main
infile=open(sys.argv[1],"r")
FileNotFoundError: [Errno 2] No such file or directory: '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_10_18_16_04_42_404135046_Decay_Peak_Features_Affinity_Gene_View.txt'
rm: cannot remove '/home/heyj/c010-datasets/Internal/COPD/TEPIC/_TEPIC_08_10_18_16_04_42_404135046_Decay_Peak_Features_Affinity_Gene_View.txt': No such file or directory
All my bed and reference files contain the "chr" prefix.
I would appreciate the help.
Best,
Joschka