Closed swarbred closed 1 year ago
looks like you this is defined in the gmc_config.yaml
repeat_associated: "{te_score} >= 0.4"
but the scoring file has nF1", "jF1", "eF1", "aF1 metrics for the repeat metric but mikado compare doesn't seem to be run against the repeats (and this wouldn't give the correct coverage)
Hi @swarbred,
as far as I remember, the TE process was not outlined in the "specs" doc, hence the sparse and rudimentary implementation. I will wait for @gemygk 's input on this.
Cheers, Christian
@cschu @gemygk The gff of repeats for us would be the repeatmasker gff output e.g. /hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/repeatmodeller-1.0.10/all_interspersed_repeats.genome.fa.out.gff
we should allow some flexibility in regard to the feature type, it should be "match_part" level, i.e. if a file had feature types of match we would ignore these and calculate the breadth of coverage of each model against all match part features. I would suggest just extracting features of match_part, similarity and exon.
Hi @cschu
As discussed, the repeats information is in the google doc under Repeats section of the google doc - ei-annotation - gene model consolidation pipeline, Pages 11-12. It's a long doc, sorry
Cheers, nothing to be sorry about from your side - I either didn't see it or thought it would not be used because we didn't use tPSIpred.
I assume there will only ever be one set of repeat metrics data? I.e., 1 all and 1 interspersed?
There would just be a single input file provided (just interspersed repeats) and just the single metric derived from this.
Hi @swarbred, @gemygk, I have implemented this now in gmc-0.8. If you could have a go and let me know how it works out that would be great.
Thanks @cschu will get something running later today
I'll try to be on standby. It was a bit of a pain, so would be good to see if it works correctly.
@cschu
So running 0.8 i'm getting an error, indicates the repeat gff is missing but the path given is correct, is this just indicating an issue with the file? To clarify gmc_metrics_repeats_extract_exons is extracting match_part, similarity and exon features?
localrules directive specifies rules that are not present in the Snakefile:
gmc_metrics_parse_repeat_coveragegmc_metrics_blastp_combine
Building DAG of jobs...
Full Traceback (most recent call last):
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/__init__.py", line 653, in snakemake
keepincomplete=keep_incomplete,
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/workflow.py", line 610, in execute
dag.init()
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 168, in init
job = self.update([job], progress=progress)
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 733, in update
raise exceptions[0]
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 698, in update
progress=progress,
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 784, in update_
raise ex
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 774, in update_
progress=progress,
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 733, in update
raise exceptions[0]
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 698, in update
progress=progress,
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/dag.py", line 805, in update_
raise MissingInputException(job.rule, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule gmc_metrics_repeats_extract_exons:
/hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.7/mikado-2.0prc2_CBG/all_interspersed_repeats.genome.fa.out.gff
MissingInputException in line 215 of /ei/software/testing/gmc/0.8/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py:
Missing input files for rule gmc_metrics_repeats_extract_exons:
/hpc-home/swarbred/Project/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.7/mikado-2.0prc2_CBG/all_interspersed_repeats.genome.fa.out.gff
unlocking
removed all locks
Hi @swarbred ,
I cannot access the locations where that was run. Concerning the repeat parser checks, I ported whatever @gemygk is doing in his script - but I guess that can be adapted.
As long as the file is present, it should not trigger the error you're reporting.
From a quick look at the commands he is expecting the file formated as we generate for apollo loading i.e. match_part (perhaps also specific formatting in col9), I'm pointing at a file formated like this now and it's running (the error above might have been due to how I was linking the file).
The file he is using as input is a more valid GFF3 file, which is no bad thing, it means the raw repeatmasker file has to be reformatted, I might have been more generic given bedtools just cares about the coordinates as far as i'm aware (though I might be wrong). We should specify the required formatting in the documentation.
As I said, this ("MissingInputException in line 215 ...") is a snakemake error. At the point where the error message is raised, it doesn't matter how the file is formatted, it just doesn't find it. I wonder whether this is a permission issue? Is the whole run data located in that Project folder or just the repeat gff? Maybe the node gmc is running on doesn't have access to that location.
Yes it was my sym linking that caused the issue but it looks like the file I was using would have failed later based on the commands that are being run.
Yes it was my sym linking that caused the issue but it looks like the file I was using would have failed later based on the commands that are being run.
But would you expect that it should work with that file? If so, could you share it, please?
[swarbred@EI-HPC 2.0rc6_d094f99_CBG]$ head ../../repeatmodeller-1.0.10/all_repeats.genome.fa.out.gff
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 2250 2310 24.4 + . Target "Motif:GA-rich" 1 61
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 4763 4784 4.7 + . Target "Motif:(TCAC)n" 1 23
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 10042 10075 12.8 + . Target "Motif:(AATG)n" 1 35
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 25951 25976 13.1 + . Target "Motif:(TTC)n" 1 26
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 27257 27284 11.7 + . Target "Motif:(T)n" 1 28
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 58893 58927 14.1 + . Target "Motif:A-rich" 1 33
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 87364 87393 10.7 + . Target "Motif:GA-rich" 1 32
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 100446 100479 10.7 + . Target "Motif:(TTTTTC)n" 1 33
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 110527 110571 14.5 + . Target "Motif:(CTTTCTG)n" 1 47
Uromyces_beticola_EI_v1.1_contig_000001 RepeatMasker similarity 112013 112057 15.7 + . Target "Motif:(TTTATT)n" 1 44
[swarbred@EI-HPC 2.0rc6_d094f99_CBG]$ head ../../repeatmodeller-1.0.10/all_interspersed_repeats.gff
##gff-version 3
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match 1307 1407 29 - . ID=all_interspersed_repeats:RM1;Name=Motif:REP-8_PSt
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match_part 1307 1407 29 - . ID=all_interspersed_repeats:RM1-exon1;Parent=all_interspersed_repeats:RM1
###
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match 4954 5068 26 + . ID=all_interspersed_repeats:RM2;Name=Motif:Gypsy-28_PGr-I
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match_part 4954 5068 26 + . ID=all_interspersed_repeats:RM2-exon1;Parent=all_interspersed_repeats:RM2
###
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match 5250 5363 22 + . ID=all_interspersed_repeats:RM3;Name=Motif:Gypsy-28_PGr-I
Uromyces_beticola_EI_v1.1_contig_000001 all_interspersed_repeats match_part 5250 5363 22 + . ID=all_interspersed_repeats:RM3-exon1;Parent=all_interspersed_repeats:RM3
###
First file is the repeatmasker output, second is what Gemys script is expecting.
The current commands are more specific that I thought they would be
awk '$3 == "match_part"' /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.8/2.0rc6_d094f99_CBG/all_interspersed_repeats.genome.fa.out.gff | sed -e 's/\tmatch_part\t/\texon\t/' -e 's/\t[+-]\t/\t.\t/'
in that the first thing we do is extract match_part features and convert these to exon lines (plus set the strand to .), so obviously you need to provide an appropriate file, my comment above from a couple of days ago was about making this more generic i.e. either match_part + exon + similarity so that the repeatmasker output could be used directly.
At the moment if as a user you provided as input a file like the file we end up creating from the above command i.e. with exon lines it would fail, it seems a bit odd to require match_part then transform this to exon (these are not really exons as well) but not allow the user to provide exons in the input file.
the 9th col is also I think not valid for the script as I think he cuts based on ; later (sorry can't check now)
Again gemy can correct but I dont think bedtools cares what the feature is we are just calculating coverage of gene models against any features in this repeat file, so as long as we have features that are the level of "match_part or exon or similarity" then can potentially be more generic than the current script requires.
It's not a biggy as long as I (the user) knows what format is expected.
I'm happy to make adjustments so that it can process the original repeatmasker output as well. Just let me know what you'd like to have/support.
So, with @gemygk 's help, gmc now takes in the actual repeatmasker output. This can be tested in 0.8.1 later today (currently doing a test run).
0.8.1 can be tested
@cschu So my 0.8 run completed, all the repeat steps completed fine but we do have an issue to resolve.
No models were classified as TE related
the following would be one example we should expect
Looking through the metrics_matrix.txt, we have repeat coverage calculated (good)
cut -f1,179 metrics_matrix.txt |head
tid repeats_cov
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000010.1 91.1800
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000020.1 59.8000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000030.1 0.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000040.1 38.9400
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000050.1 95.1900
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000060.1 0.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000070.1 100.0000
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000080.1 74.0200
UROBE1963355_run1_wRNA_UROBE1963355_r1_0000090.1 0.0000
The model shown in the above apollo example has the correct coverage
tid repeats_cov
mikado.loci.ptn_coding_mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1 88.6000
If you lookup the gene model in the final annotation corresponding to this model the te_score is 0.0
grep "id\|mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1" ../mikado.annotation.collapsed_metrics.tsv |cut -f1,9
#transcript te_score
mikado.Uromyces_beticola_EI_v1.1_contig_000407G1.1 0.0
All models have a 0.0 te_score
cat ../mikado.annotation.collapsed_metrics.tsv |cut -f9 |sort -u
0.0
te_score
So we are calculating the repeat coverage correctly but when making the final te assessment not populating the te_score like we are with the other metrics in mikado.annotation.collapsed_metrics.tsv hence why we then dont classify anything as te related
Yea, that was still populating with a 0.0 from initial development. This is now activated in 0.8.2.
I reran the 0.8 run from the pick stage with 0.8.2 and te are classified as expected
Phew.
@cschu
Minor change needed to the te_score in mikado.annotation.collapsed_metrics.tsv and the metrics_matrix.txt file
the te_score is a percentage in mikado.annotation.collapsed_metrics.tsv and metrics_matrix.txt but the gmc_run.run_config.yaml when referring to '{te_score} >= 0.4' is a decimal i.e. the intention is to filter at 40%. So something with a 0.84% repeat overlap is being called a TE gene
We should change this to a decimal in mikado.annotation.collapsed_metrics.tsv and metrics_matrix.txt as mikado can use 0-1 scores as raw values while the percentages need to be scaled.
@swarbred This is implemented in 1.0.2, which should be ready for testing after yesterday's/last night's testing had some technical issues.
I think all issues have been resolved in the latest gmc version
we'll see what 1.2.4 will do with that...
Closing for now.
cc @swarbred
We need to make input repeats GFF file format more generic than being specific to RepeatMasker.
When using other tools like RED, which does not generate a standard GFF3 or a BED output, in which case, we will be creating a GFF3 file in match->match_part
format which could be used by Minos.
Yep I'm a little confused as this was originally working with match match_part and changed to accommodate the repeatmasker input but I assumed this was in addition to and not to replace.
Fixed in PR #62
Hi @cschu
I'm not sure using repeats to define the TE related genes is implemented in GMC. I'm running your test set with your repeat file added but my run with real data did not classify TE genes and looking back at earlier runs I don't think I've tested this.
generate_metrics.py
would seem to indicate you are not parsing this and I don't think we generate nF1 etc as this is not via mikado compare. @gemygk can confirm the process. To date we have only used the repeats in the classification stage to define biotype transposable_element_gene not for the fragmentary checks or scoring (currently gmc creates a repeat section in the scoring file). I would add the repeat metric to the metrics matrix but not add it to the mikado scoring config. That way the metric exists i.e. you could manually add this to the config (e.g. to use in the requirements to exclude TE genes) but it's not being used as default (i've been setting the multiplier as 0 and not_fragmentary_min_value as 1, which effectively turns this off, as it doesn't make sense to use this to score models).
The metric we generate is a single breadth of coverage of each model by all (i.e. overlapping) repeats. I think Gemy is just using bedtools for this. This is then used to define transposable_element_gene and there is a simple coverage cutoff for this (which as user you want to control via the gmc config).
David