Editing the configuration file

francicco commented 3 years ago

Hi,

I'm trying to find a way to tell mikado not to throw away nested genes. Looking at the mikado tutorial, I still haven't find anything. which are the parameters I should change or introduce?

Thanks a lot F

swarbred commented 3 years ago

Hi @francicco

Can you be clear by what you mean by "nested" spliced models entirely within introns will not generally be removed by the default configuration, if you are referring to models with partial exon overlaps then there are a few place in the config that can be adjusted

see

[pick.clustering]
# Parameters related to the clustering of transcripts into loci.
# - cds_only: boolean, it specifies whether to cluster transcripts only according to their CDS (if present).
# - min_cds_overlap: minimal CDS overlap for the second clustering.
# - min_cdna_overlap: minimal cDNA overlap for the second clustering.
# - flank: maximum distance for transcripts to be clustered within the same superlocus.
# - remove_overlapping_fragments: boolean, it specifies whether to remove putative fragments.
# - purge: boolean, it specifies whether to remove transcripts which fail the minimum requirements check - or whether to ignore those requirements altogether.
# - simple_overlap_for_monoexonic: boolean. If set to true (default), then any overlap mean inclusion in a locus for or against a monoexonic transcript. If set to false, normal controls for the percent
# - max_distance_for_fragments: maximum distance from a valid locus for another to be considered a fragment.
cds_only = false
min_cds_overlap = 0.2
min_cdna_overlap = 0.2
purge = true
flank = 200
simple_overlap_for_monoexonic = true

If the "missing" genes are mono exonic switching simple_overlap_for_monoexonic = false will allow some mono exonic genes to be retained with a partial overlap.

You can also adjust min_cds_overlap = 0.2, min_cdna_overlap = 0.2 to allow a greater overlap (for both mono and multi exonic genes)

Changing cds_only = false to cds_only = true will define clusters based on cds structure overlap i.e. to belong to the same gene multi exonic transcripts needs to share a coding splice site. That will mean that genes which share only splice junctions in UTRs will be output as separate genes.

There may be other reasons why the transcripts you regard as missing are removed i.e. they may not meet the requirements in the scoring file (see scoring file requirements:) or could be even removed during the prepare stage.

If your nested models share splice sites with another selected model then these may be excluded due to the as_requirements (see the scoring file).

If these are mono exonic models then they may be being filtered by the not_fragmentary expression (scoring file) you can either adjust this expression or change how fragments are handled in the pick.fragments section of the configuration

[pick.fragments]
# Parameters related to the handling of fragments.
# - remove: boolean. Whether to remove fragments or leave them, properly tagged.
# - max_distance: maximum distance of a putative fragment from a valid gene.
# - valid_class_codes: which class codes will be considered as fragments. Default: (p, P, x, X, i, m, _). Choices: '_' plus any class code with category 'Intronic', 'Fragment', or 'Overlap'.
remove = true
max_distance = 2000
valid_class_codes = ["p", "P", "x", "X", "i", "m", "_", "e", "o"]

The class codes https://mikado.readthedocs.io/en/latest/Usage/Compare.html#class-codes define what can be regarded as a fragment removing "i" would exclude intronic mono exonic transcripts from this definition.

If you can show a browser screen shot to show the kinds of models that are being removed that might help me to advise. But if you are losing mono-exonic loci then removing "i" from the fragments class codes and switching simple_overlap_for_monoexonic = false may achieve what you want.

I'm also going to give a shameless plug for the workshop we are organising for April next year that may by of interest https://www.earlham.ac.uk/genome-annotation-workshop-2021

Best David

francicco commented 3 years ago

Thank you, I'll try to play with it.

One last small question. During mikado serialize I get the log message:

2020-11-27 11:34:40,242 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table

Why 0 objects? I don't get it. Thanks F

francicco commented 3 years ago

I'm also getting these kinds of problems

On the top, the reference annotation, followed by portcullis' splice sites, mikado pick, and the very bottom the different transcript reconstructed by cufflinks.

Any tips? F

lucventurini commented 3 years ago

Thank you, I'll try to play with it.

One last small question. During mikado serialize I get the log message:

2020-11-27 11:34:40,242 - serialiser - blast_serialiser.py:221 - INFO - __serialize_queries - MainProcess - Loaded 0 objects into the "query" table

Why 0 objects? I don't get it. Thanks F

Hi @francicco , the message means that the necessary transcripts are already inside the database. It can be safely ignored.

francicco commented 3 years ago

Ok, good.

About the first question, I have made a script that removes those kinds of transcripts before giving them to mikado. It was faster.

F

lucventurini commented 3 years ago

I'm also getting these kinds of problems

On the top, the reference annotation, followed by portcullis' splice sites, mikado pick, and the very bottom the different transcript reconstructed by cufflinks.

Any tips? F

This is a weird behaviour. Mikado should select against the transcript with such a high number of UTR exons, frankly, and I am also very surprised that Mikado does not break up the long model as I bet that it should have more than one ORF.

Could you please check the following for us?

Is Mikado running in "nosplit" mode? Could you launch mikado pick with the explicit flag --mode stringent or even --mode permissive and see whether this solves the issue in the locus?
Relatedly, how are you obtaining the CDSs? Would you please be able to check whether the long transcript there actually has more than one ORF called? Mikado relies heavily on the ORF data to discriminate and break chimeric transcripts, so if the ORFs are not present, this could impair the functioning of Mikado.
A model with such a large number of UTR exons can and is clearly selected against in the scoring file. Specifically, this section:

scoring:
  five_utr_length:
    filter: {operator: le, value: 3500}
    rescaling: target
    value: 400
  five_utr_num:
    filter: {operator: lt, value: 4}
    rescaling: target
    value: 2
  three_utr_length:
    filter: {operator: le, value: 3500}
    rescaling: target
    value: 800
  three_utr_num:
    filter: {operator: lt, value: 3}
    rescaling: target
    value: 1

You could increase the "value" here to reward more transcripts that have "well-formed" UTRs.

You could also explicitly tell Mikado to remove any transcript with such a large number of UTR exons, using the requirements section of the scoring file; specifically, given this:

requirements:
  expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and  (cdna_length.multi, or,  combined_cds_length.multi) and max_intron_length, and, min_intron_length and proportion_verified_introns_inlocus ) or (exon_num.mono and (combined_cds_length.mono or cdna_length.mono))) ]
  parameters:
    combined_cds_fraction.ncrna: {operator: eq, value: 0}
    combined_cds_fraction.coding: {operator: gt, value: 0.20}
    cdna_length.mono: {operator: gt, value: 400}
    cdna_length.multi: {operator: ge, value: 300}
    combined_cds_length.mono: {operator: gt, value: 225}
    combined_cds_length.multi: {operator: gt, value: 150}
    exon_num.mono: {operator: eq, value: 1}
    exon_num.multi: {operator: gt, value: 1}
    max_intron_length: {operator: le, value: 20000}
    min_intron_length: {operator: ge, value: 5}
    proportion_verified_introns_inlocus: {operator: gt, value: 0}

You can modify in the following way:

requirements:
  expression: [(combined_cds_fraction.ncrna or combined_cds_fraction.coding), and, ((exon_num.multi and  (cdna_length.multi, or,  combined_cds_length.multi) and max_intron_length, and, min_intron_length and proportion_verified_introns_inlocus and five_utr_num and three_utr_num) or (exon_num.mono and (combined_cds_length.mono or cdna_length.mono))) ]
  parameters:
    combined_cds_fraction.ncrna: {operator: eq, value: 0}
    combined_cds_fraction.coding: {operator: gt, value: 0.20}
    cdna_length.mono: {operator: gt, value: 400}
    cdna_length.multi: {operator: ge, value: 300}
    combined_cds_length.mono: {operator: gt, value: 225}
    combined_cds_length.multi: {operator: gt, value: 150}
    exon_num.mono: {operator: eq, value: 1}
    exon_num.multi: {operator: gt, value: 1}
    max_intron_length: {operator: le, value: 20000}
    min_intron_length: {operator: ge, value: 5}
    proportion_verified_introns_inlocus: {operator: gt, value: 0}
    five_utr_num: {operator: le, value: 6}
    three_utr_num: {operator: le, value: 4}

Here I added the extra requirement (both in the parameters and in the expression) that 5' UTRs must contain at most six exons and 3'UTRs at most four exons. Modifying the requirements section like this would immediately get rid of that long transcript (as it has way more than four 3' UTR exons).

I would, however, first of all check why that obviously chimeric transcript is not discarded. I really suspect that either Mikado is running in "nosplit" mode, the ORFs are not called properly, or they have not been scooped up correctly by mikado serialise. The latter possibility we can investigate together if the other two are excluded.

I hope this helps.

Luca Venturini

swarbred commented 3 years ago

@francicco

edit i see Luca has just replied but comments below

The default scoring wouldn't lead to selecting the transcript shown in the pick if all the the cufflinks models were available. You can obviously change the scoring in a way that would lead to selecting that transcript but I wouldn't expect the default scoring to select this if CDS features are present i.e. ORFs were called and loaded via serialise.

If pick is selecting that transcript and the other models shown in the cufflinks track were input, present in the prepare file and you have called and loaded ORFs then something is strange with the scoring would be my guess. The way to check this is to output the subloci scores during pick e.g. --subloci-out mikado.subloci.gff3 . You can then identify the model whose selection you query in the mikado.loci.gff3 file and get the id from the alias tag e.g. alias=LIB23436_stringtie_LIB23436_str.1.1

If you grep this from the mikado.subloci.scores.tsv

grep "LIB23436_stringtie_LIB23436_str.1.1" mikado.subloci.scores.tsv

you can find the sublocus to which it belongs e.g.

LIB23436_stringtie_LIB23436_str.1.1 LIB23436_stringtie_LIB23436_str.1.1 sublocus:Bombus_Vestalis_EIV0.2_CONTIG_0000001-:49696-50480.multi

then grep out all models which are part of this subloci

grep "tid|sublocus:Bombus_Vestalis_EIV0.2_CONTIG_0000001-:49696-50480.multi" mikado.subloci.scores.tsv |sort -k4,4n |column -t -s $'\t' |less -S

You should have several models for this subloci the one selected i.e. shown in the browser should be the highest score but you can then check the other metrics to see why this transcript scored the highest.

swarbred commented 3 years ago

I would, however, first of all check why that obviously chimeric transcript is not discarded. I really suspect that either Mikado is running in "nosplit" mode, the ORFs are not called properly, or they have not been scooped up correctly by mikado serialise. The latter possibility we can investigate together if the other two are excluded.

This shouldn't be to do with no split even with no splitting there are other models that I would expect to be selected. My guess would be if the normal scoring is being used that the orfs have not been called / loaded but having said that the screenshot you show has a pick model with CDS (which would seem to rule this out if this reflects the final pick output).

francicco commented 3 years ago

Hi guys,

Thanks a lot, I'm gonna read carefully what you're suggesting trying to use different pick mode. And see how it behave compared with my script. The way I did this was using a combinatorial approach of blast and transdecoder. Basically, I remove all transcripts that have multiple valid CDS regions (predicted by transdecoder with a valid blast hit), but, they have to be separated by a non-coding stretch of at least 100nt. Because not all fusions reported by Mikado are wrong, the majority are in fact correct. And I know that by comparing the query lengths with the hit lengths (SwissProt). The only problem that I'll have would be how then select the best method among the different results, any suggestions?

The tests will take days to finish, unfortunately

Cheers, F

swarbred commented 3 years ago

@francicco

The pick mode wont be the cause of this issue, even if you ran with no splitting one of the correct non chimeric transcripts would be selected. The example you attached should be resolved correctly if the inputs to mikado are correct (i.e. blast and CDS). If you run mikado pick with -r you can limit mikado to run on a region (i.e. just the region in your example) if you do that and also generate the --subloci-out mikado.subloci.gff3 and --monoloci-out mikado.monoloci.gff3 files then attach the mikado.loci.gff3 , mikado.monoloci.gff3 and mikado.subloci.gff3 files to this ticket + your config and scoring files.

lucventurini commented 3 years ago

Hi @francicco

Thanks a lot, I'm gonna read carefully what you're suggesting trying to use different pick mode. And see how it behave compared with my script. The way I did this was using a combinatorial approach of blast and transdecoder. Basically, I remove all transcripts that have multiple valid CDS regions (predicted by transdecoder with a valid blast hit), but, they have to be separated by a non-coding stretch of at least 100nt. Because not all fusions reported by Mikado are wrong, the majority are in fact correct. And I know that by comparing the query lengths with the hit lengths (SwissProt).

Could you please clarify what you mean by saying that the fusions reported by Mikado are wrong?

francicco commented 3 years ago

I know, if the input is fine all the rest will be smooth, that's why I'm sorting out the chimeric transcript before mikado prepare.

Because not all fusions reported by Mikado are wrong, the majority are in fact correct.

What I wanted to say is that most of mikado selections are right.

francicco commented 3 years ago

So, a small update, everything seems to work perfectly now

In the plot you can see that all the transcripts have good alignment with swissprot proteins. There's one transcriptome (red) that looks a bit more fragmented, I believe is the quality of RNAseq, probably a bit degraded, in fact, I have an excess of monoexonic transcript in this specie. Is there anything that mikado can do to, at least partially fix this problem?

Thanks a lot F

lucventurini commented 3 years ago

So, a small update, everything seems to work perfectly now

In the plot you can see that all the transcripts have good alignment with swissprot proteins. There's one transcriptome (red) that looks a bit more fragmented, I believe is the quality of RNAseq, probably a bit degraded, in fact, I have an excess of monoexonic transcript in this specie. Is there anything that mikado can do to, at least partially fix this problem?

Thanks a lot F

Hi @francicco,

In regards on where the monoexonic transcripts are, Mikado should be able to help. Is the information here on the documentation useful at all?

https://mikado.readthedocs.io/en/latest/Tutorial/Adapting/#case-study-2-noisy-rna-seq-data

francicco commented 3 years ago

Thanks! I'll have a look! F

lucventurini commented 3 years ago

Closing for now. @francicco, please let us now whether you would need further assistance on this!

EI-CoreBioinformatics / mikado

Editing the configuration file #338