enormandeau / go_enrichment

Transcripts annotation and GO enrichment Fisher tests
GNU General Public License v3.0
25 stars 16 forks source link

question about input files of step4 and step5 #6

Closed shiyi-pan closed 2 years ago

shiyi-pan commented 3 years ago

Hi, I want to use go_enrichment , but I have some question about step4 and step5.

first, in step4, how could I get the significant_ids.txt before I do Fisher tests. second , what's the wanted_transcripts.ids and association.tsv? third , at step3 ,I got an output file named sequence_annotation.txt . It seems don't use in step4 and step5.

could you explain these questions for me ? thank you very much.

enormandeau commented 3 years ago
  1. The GO enrichment compares a set of genes of interest to all the genes present in the transcriptome. These genes of inerest are what is refered to in the significant_ids.txt file. They can be genes whose expression level differs between conditions and for which you want to know if they are enriched with some GO terms.

  2. The wanted_transcripts.ids file contains one transcript name per line. The annotation.tsv file is the result of annotating the transcripts with the GO database. However, these names are not what I am expecting. If you posted the first 20 lines of each of the files it would help answer your questions.

  3. Same thing. Please post the first 20 lines of this file.

shiyi-pan commented 3 years ago

thank you for your reply . here is my step3 code: $PATHON $SCRIPTS/03_annotate_genes.py $SEQUENCE_FILE $ANNOTATION_FOLDER sequence_annotation.txt and I get an sequence_annotation.txt like this:

Name Accession Fullname Altnames Pfam GO CellularComponent Molecular Function Biological Process NN01g00001.1 locus=Chr01:150058:150620:- Q15KI9 Q0WV21 Q15KJ0 Q9CAB1 Q9CAB2 Protein PHYLLO, chloroplastic PF00561;PF13378;PF02775;PF16582;PF02776; GO:0031969;GO:0016021;GO:0070204;GO:0070205;GO:0046872;GO:0043748;GO:0030976;GO:0009063;GO:0009234;GO:0042550;GO:0042372; C:chloroplast membrane; C:integral component of membrane; F:2-succinyl-5-enolpyruvyl-6-hydroxy-3-cyclohexene-1-carboxylic-acid synthase activity; F:2-succinyl-6-hydroxy-2,4-cyclohexadiene-1-carboxylate synthase activity; F:metal ion binding; F:O-succinylbenzoate synthase activity; F:thiamine pyrophosphate binding; P:cellular amino acid catabolic process; P:menaquinone biosynthetic process; P:photosystem I stabilization; P:phylloquinone biosynthetic process; NN01g00002.1 locus=Chr01:238607:249703:+ Q9SZL8 Protein FAR1-RELATED SEQUENCE 5 PF03101;PF10551;PF04434; GO:0005634;GO:0008270;GO:0006355; C:nucleus; F:zinc ion binding; P:regulation of transcription, DNA-templated; NN01g00003.1 locus=Chr01:258602:264467:- Q8GX93 O65486 Q93XN4 Q9SVX1 Chloride channel protein CLC-e CBS domain-containing protein CBSCLC3; PF00571;PF00654; GO:0034707;GO:0009535;GO:0005247;GO:0034765; C:chloride channel complex; C:chloroplast thylakoid membrane; F:voltage-gated chloride channel activity; P:regulation of ion transmembrane transport;

But in the following steps as you descripted in go_enrichment/01_scripts , The input files are significant_ids.txt ,all_ids.txt ,all_go_annotations.csv and go_enrichment.csv , as you said above, the significant_ids.txt is the set of genes I interest and all_ids.txt is all the genes . what are the all_go_annotations.csv and go_enrichment.csv ? what"s the relation between them and the step3 result file sequence_annotation.txt.

enormandeau commented 3 years ago

Here is how I run this step :

./01_scripts/03_annotate_genes.py 03_sequences/analyzed_genes.fasta \
    05_annotations/ all_annotations_transcripts.tsv

The all_annotations_transcripts.tsv file is the output file name.