Use of repeats in classification

swarbred commented 4 years ago

Hi @cschu

I'm not sure using repeats to define the TE related genes is implemented in GMC. I'm running your test set with your repeat file added but my run with real data did not classify TE genes and looking back at earlier runs I don't think I've tested this.

generate_metrics.py

("repeat", {"metrics": ["nF1", "jF1", "eF1", "aF1"], "parser": None})

would seem to indicate you are not parsing this and I don't think we generate nF1 etc as this is not via mikado compare. @gemygk can confirm the process. To date we have only used the repeats in the classification stage to define biotype transposable_element_gene not for the fragmentary checks or scoring (currently gmc creates a repeat section in the scoring file). I would add the repeat metric to the metrics matrix but not add it to the mikado scoring config. That way the metric exists i.e. you could manually add this to the config (e.g. to use in the requirements to exclude TE genes) but it's not being used as default (i've been setting the multiplier as 0 and not_fragmentary_min_value as 1, which effectively turns this off, as it doesn't make sense to use this to score models).

The metric we generate is a single breadth of coverage of each model by all (i.e. overlapping) repeats. I think Gemy is just using bedtools for this. This is then used to define transposable_element_gene and there is a simple coverage cutoff for this (which as user you want to control via the gmc config).

David

swarbred commented 4 years ago

looks like you this is defined in the gmc_config.yaml

repeat_associated: "{te_score} >= 0.4"

but the scoring file has nF1", "jF1", "eF1", "aF1 metrics for the repeat metric but mikado compare doesn't seem to be run against the repeats (and this wouldn't give the correct coverage)

cschu commented 4 years ago

Hi @swarbred,

as far as I remember, the TE process was not outlined in the "specs" doc, hence the sparse and rudimentary implementation. I will wait for @gemygk 's input on this.

Cheers, Christian

swarbred commented 4 years ago

@cschu @gemygk The gff of repeats for us would be the repeatmasker gff output e.g. /hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/repeatmodeller-1.0.10/all_interspersed_repeats.genome.fa.out.gff

we should allow some flexibility in regard to the feature type, it should be "match_part" level, i.e. if a file had feature types of match we would ignore these and calculate the breadth of coverage of each model against all match part features. I would suggest just extracting features of match_part, similarity and exon.

gemygk commented 4 years ago

Hi @cschu

As discussed, the repeats information is in the google doc under Repeats section of the google doc - ei-annotation - gene model consolidation pipeline, Pages 11-12. It's a long doc, sorry

cschu commented 4 years ago

Cheers, nothing to be sorry about from your side - I either didn't see it or thought it would not be used because we didn't use tPSIpred.

cschu commented 4 years ago

I assume there will only ever be one set of repeat metrics data? I.e., 1 all and 1 interspersed?

swarbred commented 4 years ago

There would just be a single input file provided (just interspersed repeats) and just the single metric derived from this.

cschu commented 4 years ago

Hi @swarbred, @gemygk, I have implemented this now in gmc-0.8. If you could have a go and let me know how it works out that would be great.

swarbred commented 4 years ago

Thanks @cschu will get something running later today

cschu commented 4 years ago

I'll try to be on standby. It was a bit of a pain, so would be good to see if it works correctly.

swarbred commented 4 years ago

@cschu

So running 0.8 i'm getting an error, indicates the repeat gff is missing but the path given is correct, is this just indicating an issue with the file? To clarify gmc_metrics_repeats_extract_exons is extracting match_part, similarity and exon features?

localrules directive specifies rules that are not present in the Snakefile:
        gmc_metrics_parse_repeat_coveragegmc_metrics_blastp_combine

Building DAG of jobs...
Full Traceback (most recent call last):
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/__init__.py", line 653, in snakemake
    keepincomplete=keep_incomplete,
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/workflow.py", line 610, in execute
    dag.init()
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 168, in init
    job = self.update([job], progress=progress)
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 733, in update
    raise exceptions[0]
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 698, in update
    progress=progress,
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 784, in update_
    raise ex
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 774, in update_
    progress=progress,
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 733, in update
    raise exceptions[0]
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 698, in update
    progress=progress,
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 805, in update_
    raise MissingInputException(job.rule, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule gmc_metrics_repeats_extract_exons:
/hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.7/mikado-2.0prc2_CBG/all_interspersed_repeats.genome.fa.out.gff

MissingInputException in line 215 of /ei/software/testing/gmc/0.8/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py:
Missing input files for rule gmc_metrics_repeats_extract_exons:
/hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.7/mikado-2.0prc2_CBG/all_interspersed_repeats.genome.fa.out.gff
unlocking
removed all locks

cschu commented 4 years ago

Hi @swarbred ,

I cannot access the locations where that was run. Concerning the repeat parser checks, I ported whatever @gemygk is doing in his script - but I guess that can be adapted.

As long as the file is present, it should not trigger the error you're reporting.

swarbred commented 4 years ago

From a quick look at the commands he is expecting the file formated as we generate for apollo loading i.e. match_part (perhaps also specific formatting in col9), I'm pointing at a file formated like this now and it's running (the error above might have been due to how I was linking the file).

The file he is using as input is a more valid GFF3 file, which is no bad thing, it means the raw repeatmasker file has to be reformatted, I might have been more generic given bedtools just cares about the coordinates as far as i'm aware (though I might be wrong). We should specify the required formatting in the documentation.

cschu commented 4 years ago

As I said, this ("MissingInputException in line 215 ...") is a snakemake error. At the point where the error message is raised, it doesn't matter how the file is formatted, it just doesn't find it. I wonder whether this is a permission issue? Is the whole run data located in that Project folder or just the repeat gff? Maybe the node gmc is running on doesn't have access to that location.

swarbred commented 4 years ago

Yes it was my sym linking that caused the issue but it looks like the file I was using would have failed later based on the commands that are being run.

cschu commented 4 years ago

Yes it was my sym linking that caused the issue but it looks like the file I was using would have failed later based on the commands that are being run.

But would you expect that it should work with that file? If so, could you share it, please?

swarbred commented 4 years ago

[swarbred@EI-HPC 2.0rc6_d094f99_CBG]$ head ../../repeatmodeller-1.0.10/all_repeats.genome.fa.out.gff
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  2250    2310    24.4    +   .   Target "Motif:GA-rich" 1 61
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  4763    4784     4.7    +   .   Target "Motif:(TCAC)n" 1 23
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  10042   10075   12.8    +   .   Target "Motif:(AATG)n" 1 35
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  25951   25976   13.1    +   .   Target "Motif:(TTC)n" 1 26
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  27257   27284   11.7    +   .   Target "Motif:(T)n" 1 28
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  58893   58927   14.1    +   .   Target "Motif:A-rich" 1 33
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  87364   87393   10.7    +   .   Target "Motif:GA-rich" 1 32
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  100446  100479  10.7    +   .   Target "Motif:(TTTTTC)n" 1 33
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  110527  110571  14.5    +   .   Target "Motif:(CTTTCTG)n" 1 47
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker    similarity  112013  112057  15.7    +   .   Target "Motif:(TTTATT)n" 1 44
[swarbred@EI-HPC 2.0rc6_d094f99_CBG]$ head ../../repeatmodeller-1.0.10/all_interspersed_repeats.gff
##gff-version 3
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match   1307    1407    29  -   .   ID=all_interspersed_repeats:RM1;Name=Motif:REP-8_PSt
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match_part  1307    1407    29  -   .   ID=all_interspersed_repeats:RM1-exon1;Parent=all_interspersed_repeats:RM1
###
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match   4954    5068    26  +   .   ID=all_interspersed_repeats:RM2;Name=Motif:Gypsy-28_PGr-I
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match_part  4954    5068    26  +   .   ID=all_interspersed_repeats:RM2-exon1;Parent=all_interspersed_repeats:RM2
###
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match   5250    5363    22  +   .   ID=all_interspersed_repeats:RM3;Name=Motif:Gypsy-28_PGr-I
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats    match_part  5250    5363    22  +   .   ID=all_interspersed_repeats:RM3-exon1;Parent=all_interspersed_repeats:RM3
###

First file is the repeatmasker output, second is what Gemys script is expecting.

The current commands are more specific that I thought they would be

awk '$3 == "match_part"' /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.8/2.0rc6_d094f99_CBG/all_interspersed_repeats.genome.fa.out.gff | sed -e 's/\tmatch_part\t/\texon\t/' -e 's/\t[+-]\t/\t.\t/'

in that the first thing we do is extract match_part features and convert these to exon lines (plus set the strand to .), so obviously you need to provide an appropriate file, my comment above from a couple of days ago was about making this more generic i.e. either match_part + exon + similarity so that the repeatmasker output could be used directly.

At the moment if as a user you provided as input a file like the file we end up creating from the above command i.e. with exon lines it would fail, it seems a bit odd to require match_part then transform this to exon (these are not really exons as well) but not allow the user to provide exons in the input file.

the 9th col is also I think not valid for the script as I think he cuts based on ; later (sorry can't check now)

Again gemy can correct but I dont think bedtools cares what the feature is we are just calculating coverage of gene models against any features in this repeat file, so as long as we have features that are the level of "match_part or exon or similarity" then can potentially be more generic than the current script requires.

It's not a biggy as long as I (the user) knows what format is expected.

cschu commented 4 years ago

I'm happy to make adjustments so that it can process the original repeatmasker output as well. Just let me know what you'd like to have/support.

cschu commented 4 years ago

So, with @gemygk 's help, gmc now takes in the actual repeatmasker output. This can be tested in 0.8.1 later today (currently doing a test run).

cschu commented 4 years ago

0.8.1 can be tested

swarbred commented 4 years ago

@cschu So my 0.8 run completed, all the repeat steps completed fine but we do have an issue to resolve.

No models were classified as TE related

the following would be one example we should expect

http://apollo.tgac.ac.uk/CB-GENANNO-457_Uromyces_beticola_EI_v1_1_genome_browser/jbrowse/?loc=Uromyces_beticola_EI_v1.1_contig_000407%3A5489..17588&tracks=DNA%2CAnnotations%2Cmikado-2.0prc2_PickRun3%2Call_interspersed_repeats&highlight=

Looking through the metrics_matrix.txt, we have repeat coverage calculated (good)

cut -f1,179 metrics_matrix.txt |head
tid repeats_cov
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000010.1    91.1800
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000020.1    59.8000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000030.1    0.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000040.1    38.9400
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000050.1    95.1900
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000060.1    0.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000070.1    100.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000080.1    74.0200
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000090.1    0.0000

The model shown in the above apollo example has the correct coverage

tid repeats_cov
mikado.loci.ptn_coding_mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1   88.6000

If you lookup the gene model in the final annotation corresponding to this model the te_score is 0.0

grep "id\|mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1" ../mikado.annotation.collapsed_metrics.tsv |cut -f1,9
#transcript te_score
mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1  0.0

All models have a 0.0 te_score

cat ../mikado.annotation.collapsed_metrics.tsv |cut -f9 |sort -u
0.0
te_score

So we are calculating the repeat coverage correctly but when making the final te assessment not populating the te_score like we are with the other metrics in mikado.annotation.collapsed_metrics.tsv hence why we then dont classify anything as te related

cschu commented 4 years ago

Yea, that was still populating with a 0.0 from initial development. This is now activated in 0.8.2.

swarbred commented 4 years ago

I reran the 0.8 run from the pick stage with 0.8.2 and te are classified as expected

cschu commented 4 years ago

Phew.

swarbred commented 4 years ago

@cschu

Minor change needed to the te_score in mikado.annotation.collapsed_metrics.tsv and the metrics_matrix.txt file

the te_score is a percentage in mikado.annotation.collapsed_metrics.tsv and metrics_matrix.txt but the gmc_run.run_config.yaml when referring to '{te_score} >= 0.4' is a decimal i.e. the intention is to filter at 40%. So something with a 0.84% repeat overlap is being called a TE gene

We should change this to a decimal in mikado.annotation.collapsed_metrics.tsv and metrics_matrix.txt as mikado can use 0-1 scores as raw values while the percentages need to be scaled.

cschu commented 4 years ago

@swarbred This is implemented in 1.0.2, which should be ready for testing after yesterday's/last night's testing had some technical issues.

swarbred commented 4 years ago

I think all issues have been resolved in the latest gmc version

cschu commented 4 years ago

we'll see what 1.2.4 will do with that...

Closing for now.

gemygk commented 3 years ago

cc @swarbred We need to make input repeats GFF file format more generic than being specific to RepeatMasker. When using other tools like RED, which does not generate a standard GFF3 or a BED output, in which case, we will be creating a GFF3 file in match->match_part format which could be used by Minos.

swarbred commented 3 years ago

Yep I'm a little confused as this was originally working with match match_part and changed to accommodate the repeatmasker input but I assumed this was in addition to and not to replace.

gemygk commented 1 year ago

Fixed in PR #62

EI-CoreBioinformatics / minos

Use of repeats in classification #18