BrendelGroup / AEGeAn

Integrated toolkit for analysis and evaluation of annotated genomes
http://brendelgroup.github.io/AEGeAn
ISC License
24 stars 10 forks source link

extracting intron sequences using "xtractore" option #151

Closed Veldem closed 9 years ago

Veldem commented 9 years ago

Dear Standage,

Many thanks for introducing AEGeAn tool to us for genom editing. Now, I am using "xtractore" option for extracting exon, intron and CDS. Exon and CDS sequences were seamlessly extracted using these codes:

bin/xtractore -t exon ../../Desktop/Oryza_sativa.IRGSP-1.0.29.gff3 ../../Desktop/Oryza_sativa.IRGSP-1.0.29.dna.toplevel.fa -o /home/Desktop/exon
bin/xtractore -t CDS ../../Desktop/Oryza_sativa.IRGSP-1.0.29.gff3 ../../Desktop/Oryza_sativa.IRGSP-1.0.29.dna.toplevel.fa -o /home/Desktop/CDS

But, it does not work for intronic sequences. Many thanks for your help -vahap-

standage commented 9 years ago

Without a link to the actual files you are working with, I cannot be sure what the issue is. However, I have a good guess. In most GFF3 files, intron features are not explicitly defined. Rather, the exon sequences are provided, and most scientists just understand that there is an intron between every pair of exons.

However, xtractore does not know this. So if you want xtractore to extract intron sequences, you have to provide a GFF3 file that has introns explicitly defined.

If your GFF3 file indeed has no intron features, I would recommend using the canon-gff3 program, also from AEGeAn, to create a new GFF3 file with these features. Please let me know if this solves your problem.

Veldem commented 9 years ago

Hi Standage, As you indicated, after taking a glance at it, my ".gff3" file has no intron features (Actually, the ".gff3 files" and their corresponding ".fa" files were downloaded from Ensemble FTP browser). Unfortunately, by using "canon-gff" option, ı could not create the new gff including intron features.

Usage: canon-gff3 [options] gff3file1 [gff3file2 ...]
  Options:
     -h|--help               print this help message and exit
     -i|--infer              for transcript features lacking an explicitly
                             declared gene feature as a parent, create this
                             feature on-they-fly
     -o|--outfile: STRING    name of file to which GFF3 data will be
                             written; default is terminal (stdout)
     -s|--source: STRING     reset the source of each feature to the given
                             value
     -v|--version            print version number and exit

Although I read the usage of it, still I could not get the new gff. If you give me an example usage how to use it, it will be great. Many thanks for all your help and quick response. -vahap-

standage commented 9 years ago

I'm not sure what is unclear about the instructions provided by the usage statement. Did you actually run canon-gff3 on a GFF3 file? What was the output? Did it print a bunch of GFF3 data to the terminal, or did it print a warning/error message? I won't be able to help much without more details.

Veldem commented 9 years ago

Hi Standage, Unfortunately, I could not get the ".gff3" file including introns using "canon-gff3" option. Here is the code:

home/Software/AEGeAn-0.14.1/bin/canon-gff3 Oryza_sativa.IRGSP-1.0.29.gff3 -o new_out

it gives an error list;

warning: found no valid mRNAs for gene 'gene1'
.
.
warning: found no valid mRNAs for gene 'gene35679'

and gives a non-sense output file. I thing my code or ".gff3" files is wrong! Yet, i can not solve the problem. Thank you again,

standage commented 9 years ago

Perhaps your GFF3 file uses transcript features instead of the more common canonical mRNA features? That would cause canon-gff3 to fail, and would produce the types of warning messages you're describing.

Try the following commands.

# Count the number of `mRNA` features
grep -c $'\tmRNA\t' Oryza_sativa.IRGSP-1.0.29.gff3

# Count the number of `transcript` features
grep -c $'\ttranscript\t' Oryza_sativa.IRGSP-1.0.29.gff3

If the number of mRNA features is 0 and the number of transcript features is greater than 0, then you need to convert the transcript features into mRNA features. The following command would work.

sed $'s:\ttranscript\t:\tmRNA\t:g' Oryza_sativa.IRGSP-1.0.29.gff3 > Oryza_sativa.mrnas.gff3
Veldem commented 9 years ago

Hi Stange, As you indicated, I changed transcript name with mRNA using your code, "canon-gff" option worked and successfully extracted intronic sequences. To verify the result, I also applied a python script to parse ".gff3" fies (script name: extract_intron_gff3_from_gff3.py, obtained from this link; https://github.com/irusri/Extract-intron-from-gff3, following this command; awk '/intron/{print}' out.gff3_introns.gff3 | sort -k 1,1 -k4,2n) Both of analysis gave approximately same results; 1) canon-gff result: 147,696 intronic sequence 2) extract_intron_gff3_from_gff3.py: 148,850 intronic sequence I think that I can use your tool, many thanks again for all your help ! -vahap-

standage commented 9 years ago

Great!