why so many 'unknown' sequences in generated TE lib

Yiguan commented 8 months ago

Hi,

I am testing EarlGrey on the reference genome of D. melanogaster (r6.42) with the following command:

earlGrey -g dmel-all-chromosome-r6.42.fasta \
    -s drosophilaMelanogaster \
    -o ./output \
    -t 32 \
    -r arthropoda \
    -c no \
    -m yes

There are 150 TE sequences identified in the library, which seems a reasonable number

grep '>' drosophilaMelanogaster-families.fa.strained  | wc -l
# 150

But most of them (95/150) seem to be 'unknown'

grep '>' drosophilaMelanogaster-families.fa.strained | grep -i 'unknown' | wc -l
# 95

As a model species. I reckon it should not have so many 'unknown' sequences. Is there something I went wrong? or any idea how to classify these unknown sequences?

Many thanks! EarlyGrey is very cool tool!

Yiguan

TobyBaril commented 7 months ago

Hi,

What databases have you set up RepeatMasker with? As this will make a big difference to the library classification, which is done using RepeatClassifier from RepeatModeler2 (so uses exactly the same process as RepeatModeler2). I see you have also chosen to mask TEs using the arthropoda library first. For a model species, the recommendation would be to mask known repeats ONLY from that species of interest, in this case using the Drosophila melanogaster libraries. Be aware that pre-masking with known repeats provides less information to RepeatModeler for de novo TE detection and classification, which can result in poorer consensus sequences. In this case, your de nobo and unclassified TEs are in addition to already-known TEs in the arthropoda library. It is likely for something like D melanogaster that there isn't much more to find, so in these cases the repetitive regions picked up by RepeatModeler might not be so well resolved, hence they are unable to be classified.

As mentioned in the latest manuscript, this tool is intended to improve on existing automated TE annotation methods. If TE annotation is central to your study, some level of manual curation is still likely to be required to reduce the level of false positives and to further refine TE classifications, where Earl Grey can help to reduce the burden on researchers by providing pre-extended consensus sequences.

Yiguan commented 7 months ago

Thanks for the reply.

Using Master RepeatMasker Database: /data/home/ywang120/miniconda3/envs/earlgrey/share/RepeatMasker/Libraries/RepeatMaskerLib.h5
  Title    : Dfam
  Version  : 3.7
  Date     : 2023-01-11
  Families : 19,768

Species/Taxa Search:
  Arthropoda [NCBI Taxonomy ID: 6656]
  Lineage: root;cellular organisms;Eukaryota;Opisthokonta;Metazoa
  16 families in ancestor taxa; 6646 lineage-specific families

I think I may be using Dfam3.7, should I use a more recent version?

TobyBaril / EarlGrey

why so many 'unknown' sequences in generated TE lib #90