Dfam-consortium / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
226 stars 49 forks source link

gff3 file not recognized as so #161

Closed matryoskina closed 1 year ago

matryoskina commented 2 years ago

Hi,

I just masked my genome assembly and then used the script rmOutToGFF3.pl to convert the out file to gff3. The problem is that, whatever downstream software I use with this gff, the file is not recognized as proper gff3 file. By looking at the Sequence Ontology rules, it seems that column 9 should have "ID" instead of "Target", and that attributes should be separated by ";" and no space. Also, column 3 (dispersed_repeats) seems not to be recognized. In other words, it seems this is not a proper gff3 file according to Sequence Ontology. I am trying to modify this gff file but I am not 100% how I should. For example, it would be nice if column 3 would carry more info about the type of repeat, at least differentiating between LTR, DNA and simple repeats. Do you have any suggestion about it?

rmhubley commented 2 years ago

According to the GFF3 format:

Column 9: "attributes" A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape. Attribute values do not need to be and should not be quoted. The quotes should be included as part of the value by parsers and not stripped.

These tags have predefined meanings:

Target _Indicates the target of a nucleotide-to-nucleotide or protein-to-nucleotide alignment. The format of the value is "target_id start end [strand]", where strand is optional and may be "+" or "-". If the targetid contains spaces, they must be escaped as hex escape %20.

The way I read that is: it is valid to have a value that contains spaces. In this case the tag is "Target" and the value is (for example) "AluY 1 311". The quotes are not included in the output file.

In addition, the Sequence Ontology term 'dispersed_repeat' is indeed in the current ontology: http://www.sequenceontology.org/browser/current_release/term/SO:0000658

I agree that it could be refined as they seem to have added a confusing subset of additional TE related terms such as 'long_terminal_repeat' and 'nested_repeat'

So...I suspect that your GFF parser is the problem. Could you point me to the software you are using to parse the files?

mars188 commented 2 years ago

Dear @rmhubley

I am facing the same issue as the OP has mentioned above. I ran the following command to generate gff file.

RepeatMasker -a -e rmblast -lib datepalm-families.fa -dir masker -gff -pa 36 datepalm_2.fasta

Here is how the header of this gff file looks like:

gff-version 2

date 2022-07-25

sequence-region datepalm_2.fasta

scaffold_1 RepeatMasker similarity 1 2987 19.6 + . Target "Motif:A-rich" 1 2938 scaffold_1 RepeatMasker similarity 2988 3056 6.4 + . Target "Motif:(TAAATCC)n" 1 68 scaffold_1 RepeatMasker similarity 3057 3135 24.4 + . Target "Motif:A-rich" 1 76 scaffold_1 RepeatMasker similarity 3588 3671 27.1 + . Target "Motif:(CGGCGG)n" 1 84 scaffold_1 RepeatMasker similarity 3672 3858 12.6 - . Target "Motif:rnd-6_family-541" 3898 4097 scaffold_1 RepeatMasker similarity 4275 4327 11.3 - . Target "Motif:rnd-5_family-1742" 1 53 scaffold_1 RepeatMasker similarity 4679 4776 28.3 + . Target "Motif:GA-rich" 1 96

Until here it seems like it successfully generated the *.gff file but then when I tried to convert this newly generated GFF into GTF file, with whatever downstream software I use (gffread, AGAT etc), the file is not recognized as proper gff3 file.

I am not sure what's missing in it. Any help will be appreciated.

Many thanks,

mars188 commented 2 years ago

Even I tried using rmOutToGFF3.pl script to generate gff file from *.out files of repeatmasker run. It successfully generated the gff file but then again no software can convert this into GTF format.

matryoskina commented 2 years ago

Hi all,

What I did was to convert the file generated with rmOutToGFF3.pl to this Screenshot 2022-08-10 at 11 37 46

mars188 commented 2 years ago

@matryoskina this is still gff right?

I tried the same and successfully got gff3 but still can not convert it to .gtf.

Any idea?

rmhubley commented 2 years ago

There is much inconsistency in the way that GFF is parsed it seems. I ended up using this validator to debug this issue: http://genometools.org/cgi-bin/gff3validator.cgi I am not sure why, but many parsers believe that the "ID=" attribute of column 9 is a required attribute in GFF3. However the spec doesn't indicate that it is: "The ID attribute is required for features that have children (e.g. gene and mRNAs), or for those that span multiple lines, but are optional for other features.". In the original output of rmOutToGFF3.pl I did not specify that these are Genes or mRNAs nor did I explicitly indicate that a feature spans multiple lines. However, RepeatMasker does join TE fragments that are part of an ancestral insertion and therefore, I will kill-two-birds here and both make the parsers happy by including the "ID=" attribute and fill it with the RepeatMasker ID field which joins multiple lines. NOTE: Some parse act like duplicate identifiers are a problem by stating things like "Warning: duplicate feature ID 24 (18070-18215) (discontinuous feature?)" however the format supports this and this should not cause any problems.

I don't think any other changes are necessary but I can't test all parsers. If you still cannot use the output with the new version (in the development branch -- but soon to be released) let me know (with specific instructions) where you got the parser and how you ran it. For those using the -gff option to RepeatMasker itself, please note that at this time that output is in GFF v2 format. If you need GFF v3 please use the utility "util/rmOutToGFF3.pl" to convert it. In the next release I will be updating the RepeatMasker -gff option to also produce v3 format.

mars188 commented 2 years ago

dear @rmhubley Thanks a lot for the information as it has solved some of my issues and I was able to get GTF file. However, when I use this GTF file for alignment with STAR, I get the following error:

Fatal INPUT FILE error, no exon lines in the GTF file: datepalm_refGene.gtf Solution: check the formatting of the GTF file, it must contain some lines with exon in the 3rd column. Make sure the GTF file is unzipped. If exons are marked with a different word, use --sjdbGTFfeatureExon .

When I looked at the GTF, it contained only transcript and gene in the 3rd column and there was no exon listed there.

How can I fix this formatting issue? Thanks

rmhubley commented 1 year ago

I am afraid that this may be a conversion issue that you need to bring up with the STAR authors. I am not familiar with the tool. While the output of rmOutToGFF3.pl may not be exactly what that tool wants it should be valid GFF data. If you find the issue or have a suggestion for how to make the GFF3 output more conformant please let me know.