Closed shiyi-pan closed 2 years ago
The GO enrichment compares a set of genes of interest to all the genes present in the transcriptome. These genes of inerest are what is refered to in the significant_ids.txt
file. They can be genes whose expression level differs between conditions and for which you want to know if they are enriched with some GO terms.
The wanted_transcripts.ids
file contains one transcript name per line. The annotation.tsv
file is the result of annotating the transcripts with the GO database. However, these names are not what I am expecting. If you posted the first 20 lines of each of the files it would help answer your questions.
Same thing. Please post the first 20 lines of this file.
thank you for your reply . here is my step3 code: $PATHON $SCRIPTS/03_annotate_genes.py $SEQUENCE_FILE $ANNOTATION_FOLDER sequence_annotation.txt and I get an sequence_annotation.txt like this:
Name Accession Fullname Altnames Pfam GO CellularComponent Molecular Function Biological Process NN01g00001.1 locus=Chr01:150058:150620:- Q15KI9 Q0WV21 Q15KJ0 Q9CAB1 Q9CAB2 Protein PHYLLO, chloroplastic PF00561;PF13378;PF02775;PF16582;PF02776; GO:0031969;GO:0016021;GO:0070204;GO:0070205;GO:0046872;GO:0043748;GO:0030976;GO:0009063;GO:0009234;GO:0042550;GO:0042372; C:chloroplast membrane; C:integral component of membrane; F:2-succinyl-5-enolpyruvyl-6-hydroxy-3-cyclohexene-1-carboxylic-acid synthase activity; F:2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase activity; F:metal ion binding; F:O-succinylbenzoate synthase activity; F:thiamine pyrophosphate binding; P:cellular amino acid catabolic process; P:menaquinone biosynthetic process; P:photosystem I stabilization; P:phylloquinone biosynthetic process; NN01g00002.1 locus=Chr01:238607:249703:+ Q9SZL8 Protein FAR1-RELATED SEQUENCE 5 PF03101;PF10551;PF04434; GO:0005634;GO:0008270;GO:0006355; C:nucleus; F:zinc ion binding; P:regulation of transcription, DNA-templated; NN01g00003.1 locus=Chr01:258602:264467:- Q8GX93 O65486 Q93XN4 Q9SVX1 Chloride channel protein CLC-e CBS domain-containing protein CBSCLC3; PF00571;PF00654; GO:0034707;GO:0009535;GO:0005247;GO:0034765; C:chloride channel complex; C:chloroplast thylakoid membrane; F:voltage-gated chloride channel activity; P:regulation of ion transmembrane transport;
But in the following steps as you descripted in go_enrichment/01_scripts , The input files are significant_ids.txt ,all_ids.txt ,all_go_annotations.csv and go_enrichment.csv , as you said above, the significant_ids.txt is the set of genes I interest and all_ids.txt is all the genes . what are the all_go_annotations.csv and go_enrichment.csv ? what"s the relation between them and the step3 result file sequence_annotation.txt.
Here is how I run this step :
./01_scripts/03_annotate_genes.py 03_sequences/analyzed_genes.fasta \
05_annotations/ all_annotations_transcripts.tsv
The all_annotations_transcripts.tsv
file is the output file name.
Hi, I want to use go_enrichment , but I have some question about step4 and step5.
first, in step4, how could I get the significant_ids.txt before I do Fisher tests. second , what's the wanted_transcripts.ids and association.tsv? third , at step3 ,I got an output file named sequence_annotation.txt . It seems don't use in step4 and step5.
could you explain these questions for me ? thank you very much.