TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

impact of initial ,masking by RM and usage of DFAM3.8 #107

Closed estolle closed 5 months ago

estolle commented 6 months ago

Hi,

I've been trying to run the newest EG version with RM4.16 and DFAM3.8 (Hymenoptera subset).

I ran into a small problem in the earlgrey executable where the RM libary is specified (before famdb.py is called). I modified the else statement to point at the famdb folder as location of the Libraries, otherwise famdb.py throws an error

       <<< Getting RepeatMasker Sequences for Hymenoptera and Saving as Fasta >>>
ERROR:famdb_globals:Please specify a directory to operate on with the -i/--db_dir option.
famdb_globals --db_dir
famdb.py -i DB_DIR 
libpath="$(which RepeatMasker | sed 's|bin.*|share/RepeatMasker/Libraries/famdb|g')"

I also added -uncurated to the RepeatMasker call (if a species / taxon is specified, in my case "Hymenoptera"

firstMask()
{
        cd ${OUTDIR}/${species}_RepeatMasker
        rmthreads=$(expr $ProcNum / 4)
        RepeatMasker -species $RepSpec -norna -lcambig -uncurated -s -a -pa $rmthreads -dir $OUTDIR/${species}_RepeatMasker $genome

I actually want to avoid that the .lib file is always generated from scratch(because I run many genomes with the same "Hymenoptera" subset, so I tried to softlink the .lib file but later get an error. Currently I wanna try instead to run my initial library with the -l flag (custom consensus lib) instead the -r $SPecies flag because I added some more sequences to the .lib file

                -r == RepeatMasker search term (e.g arthropoda/eukarya)
                -l == Starting consensus library for an inital mask (in fasta format)

But I was anyway wondering how much of an impact this initial masking/RM-scan has?

And how can I improve the classification (are similarities to the RepeatMasker.lib sequences used and then the repeat ID taken from its header information?

Thanks alot

TobyBaril commented 6 months ago

Hi,

Thanks for the information - currently Earl Grey is configured for Dfam 3.7 as standard (due to the large and split database size in 3.8 making it impossible to package in a conda release). The developers are looking at fixing this for the next Dfam release, so we should be able to resolve this at some point soon. There is also a container with Dfam 3.8 root partition already installed and configured if this is of use (extra partitions can be quickly added to the container, then it can be committed as a local image).

The intial masking can significantly impact the quality of the de novo TE library generated for each species, as it gives less information to the de novo run than an unmasked genome. For example, you may mask something as a hAT because there is sufficient similarity to something known in the database, when in fact it is a different family that should be resolved de novo - with less unmasked copies, or fragments, in the genome, the quality of sequences to generate a novel consensus decreases.

Further, this can also impact TE divergence estimates, as the consensus existing in the library is likely from a more distal species than the one being analysed. Therefore, intact elements can differ from the consensus, and so skew the age distribution towards elements looking older than they would do if they were annotated with a consensus generated from the species of interest.

"Improving" classification is not a simple problem. Currently, we use RepeatClassifier as the de facto standard as employed in RepeatModeler2. This looks for similarity to curated elements and then known TE-associated protein domains. Outside of this, the only way to improve classification is to manually curate elements (something like MCHelper is good for this), or increase sampling and curation efforts across eukaryotes to improve database quality. As a note here, anything uncurated has not been checked for classification accuracy at all - there are many multi-copy host genes in the uncurated section of dfam (e.g venom genes, olfactory receptors, detoxification genes etc etc) due to RepeatModeler picking these up as repeated units in the all-by-all genome alignment, so I would not suggest including uncurated elements unless you are aware of the impact that this will have on your annotation quality.

TobyBaril commented 6 months ago

As a note, I have also updated the Dfam 3.8 docker container to prevent replication of the same library multiple times. You may be able to modify your installation to do the same. I added/modified the following on lines 94-98:

if [ ! -f /usr/local/share/earl_libraries/${RepSpec}.RepeatMasker.lib ]; then
        mkdir -p /usr/local/share/earl_libraries/
        famdb.py -i $libpath families -f fasta_name --include-class-in-name -a -d $RepSpec > /usr/local/share/earl_libraries/${RepSpec}.RepeatMasker.lib
    fi
    RepSub=/usr/local/share/earl_libraries/${RepSpec}.RepeatMasker.lib

I am hesitant to default the search to include uncurated elements due to the issues this can create (as above) and also as general users may not be aware of the drop in annotation quality by including these.

estolle commented 6 months ago

Thanks alot for the detailed explanation! Its very helpful to decide how to best analyze our genomes.

From what I understood now, wouldn't it make sense to avoid an initial masking altogether then?

I have seen the MChelper pipeline as well. It seems they concept of consensus extraction etc is quite similar so I didnt think its helpful to run MChelper again on the EarlGrey TE Library. But I'll try the classification module. I am trying to find out how RepeatClassifier within RModeler2 works - whether its based on the repeatlib that is configured with RepeatMasker. In that case it may be useful to still use the uncurated DFAM library for the Insect order (I would try to make an attempt to remove some common host genes from it then).

Another question I had earlier already, and you noted already the gene family (host genes) present in DFAM. If EG would pick up host genes in the TE library, is there somewhere an "easy" way to to filter them out (e.g. base on a nt / aa multifasta containing sequences of repetitive gene families. I have a couple of typical candidates in insects, which ideally would be filtered somewhere within the EG pipeline.

Thanks for your advice!

TobyBaril commented 6 months ago

Yes, it makes sense to completely avoid an initial masking unless you have an existing library for (ideally) the same species, or a very closely related one - in most non-model cases this doesn't exist!

RepeatClassifier runs a few steps. It looks for similarity to known TE families in your RepeatMasker library, with the caveat that it will only look for similarity to curated families and will ignore uncurated ones. As far as I know at the moment there is no way to get RepeatClassifier to include uncurated families in the classification step, which I think generally makes sense to ensure the classifications have a higher chance of being correct. It then looks for similarity to known TE protein domains, which can be found in the RepeatMasker installation , and should be named something like RepeatPeps...

The easiest way to filter out candidate host genes is to take the TE library (.strained file) and run a BLASTX or hmmer search against your candidate genes and filter anything with reasonable matches (this is generally easier manually, but something like a certain coverage length + percent ID). We are working on incorporating such a search into TEstrainer, but this is still a work in progress for the moment.

TobyBaril commented 5 months ago

Closing for the moment, feel free to reopen if required