EI-CoreBioinformatics / reat

Robust Eukaryotic Annotation Toolkit
https://reat.readthedocs.io/en/latest/
MIT License
17 stars 3 forks source link

call-LengthChecker - presence of extra base (stop codon) affects classifcation #21

Closed swarbred closed 2 years ago

swarbred commented 2 years ago

seqkit translate outputs the stop codon if present, this is different to gffread which removes the final stop.

This has a knock on affect when using classify_transcripts with all models classified as silver (no gold or bronze)

The only difference between the two diamond files is the 1 amino acid longer query protein in the post seqkit diamond file (first example (394 aa) compared to the second example, with the protein from gffread (393 aa)).

That additional amino acid on the query length causes all query modes to be classified as silver.

cat head_test_diamond.hits.tsv mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A103XHT5 394 393 80.4 393 77 0 1 393 1 393 1.30e-233 647 87.3 6EDDXEXHXHXQXQXLXLT2SF1QM3FS3DE5PS3SP2IV5MR14SL3IL3SP7NS3SN1SK2TA10KR6HP1HY5NK1IV12RL20NT8KRKR2QR2NY4GVKRPS13GT10LV3RGSA2ED1IK7MS7GD2QH29VT6EANK2FY11VD3TA2VS1GVEDWG1TC8RH4VL11KR6LV12FC1TS2RG1IV3LF5VLFY1SI1SNSG1FS7VE6YI3LM8FH2 mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A251U8X2 394 383 78.7 375 80 0 19 393 9 383 4.12e-216 603 86.4 2SQHKSTFS1TS1DN2QESYVA1TP2SQ8MR16RY1IL3SP2IV1ST2NT3SC1SK2TA16CSHS1HYDH3DNNH1IVND32NT8KRKN1NSQR2NY2FR1GAKL2WC8DQ2GT3KR2GE3LV2ND1SA2ES1IS1KR5MY6GDGA2QH20RK8VM6EANK2FY2VI8VD6VA3WS1TC7NV10GR25LFFC1TS2RG1IL3LF4VI1FY6FS5FV1VE4YC1YF10IN1FP2

(/ei/software/testing/reat/dev_prediction_module/x86_64) [swarbred@EI-HPC interactive Test]$ cat head_test_ori_diamond.hits.tsv mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A103XHT5 393 393 80.4 393 77 0 1 393 1 393 1.25e-233 647 87.3 6EDDXEXHXHXQXQXLXLT2SF1QM3FS3DE5PS3SP2IV5MR14SL3IL3SP7NS3SN1SK2TA10KR6HP1HY5NK1IV12RL20NT8KRKR2QR2NY4GVKRPS13GT10LV3RGSA2ED1IK7MS7GD2QH29VT6EANK2FY11VD3TA2VS1GVEDWG1TC8RH4VL11KR6LV12FC1TS2RG1IV3LF5VLFY1SI1SNSG1FS7VE6YI3LM8FH2 mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A251U8X2 393 383 78.7 375 80 0 19 393 9 383 3.96e-216 603 86.4 2SQHKSTFS1TS1DN2QESYVA1TP2SQ8MR16RY1IL3SP2IV1ST2NT3SC1SK2TA16CSHS1HYDH3DNNH1IVND32NT8KRKN1NSQR2NY2FR1GAKL2WC8DQ2GT3KR2GE3LV2ND1SA2ES1IS1KR5MY6GDGA2QH20RK8VM6EANK2FY2VI8VD6VA3WS1TC7NV10GR25LFFC1TS2RG1IL3LF4VI1FY6FS5FV1VE4YC1YF10IN1FP2

classify_transcripts --evalue_filter 1.0E-6 --min_pct_cds_fraction 0.5 --max_tp_utr_complete 1 --max_tp_utr 2 --min_tp_utr 1 --max_fp_utr_complete 2 --max_fp_utr 3 --min_fp_utr 1 -b head_test_diamond.hits.tsv --query_start_hard_filter_distance 10 --query_start_score 5 --query_start_scoring_distance 30 --query_end_hard_filter_distance 10 --query_end_score 5 --query_end_scoring_distance 30 -t test_all_models.clustered.gff --target_start_hard_filter_distance 10 --target_start_score 5 --target_start_scoring_distance 30 --target_end_hard_filter_distance 10 --target_end_score 5 --target_end_scoring_distance 30 --min_query_coverage_hard_filter 90 --min_query_coverage_score 5 --min_query_coverage_scoring_percentage 30 --min_target_coverage_hard_filter 90 --min_target_coverage_score 5 --min_target_coverage_scoring_percentage 30 --max_single_gap_hard_filter 20 --max_single_gap_score 5 --max_single_gap_scoring_length 30; myhist

test_all_models.clustered.gff.txt head_test_diamond.hits.tsv.txt head_test_ori_diamond.hits.tsv.txt

swarbred commented 2 years ago

Looking at a different run on the protist and there we have a spread of gold silver and bronze classified transcripts.

so not the same negative results as the marigold run above (where all were silver)

The only clear difference i see is that in marigold the set of input --homology_proteins lacked stops where in the case of the protists the stops were present.

so for marigold the earlier run (pre seqkit) the reat created protein file MATCHED the input file i.e. stop codon include and this resulted in mixed classified transcripts (gold, silver and bronze) where as when the reat created proteins file (post seqkit change) DID NOT MATCH the input proteins (i.e. the reat protein file contained the stop and the input lacked them) then we got only silver classified transcripts.

So classify seems very sensitive to something (i.e. including the final stop codon in the protein file) that I wouldn't expect to have such a dramtic difference.

ljyanesm commented 2 years ago

Currently gold category is reserved to 100% perfect matches, even a single aa off will be classified as partial, there could be a 'wiggle' parameter added or some other check, but as defined, the scenario above would result in no gold models

swarbred commented 2 years ago

We should have a user available wiggle parameter and set a default of say 3, otherwise cases like this where the db and the query set dont match regarding if the stop is included will have a major affect. Thanks again

swarbred commented 2 years ago

@gemygk will make the change to remove any terminal stops from the input (db) proteins sequence and also from the query proteins post seqkit i.e. change

https://github.com/EI-CoreBioinformatics/reat/blob/8ffc62b0e3caa52c751f6fdeba9f78d01ff20127/annotation/prediction_module/main.wdl#L1180

and

https://github.com/EI-CoreBioinformatics/reat/blob/8ffc62b0e3caa52c751f6fdeba9f78d01ff20127/annotation/prediction_module/main.wdl#L1176

That way for the diamond alignment both the db and the query proteins lack the stop and are consistent, and we fix this for the marigold run (to get this through I just stopped the pipeline corrected the file and restarted so it used the corrected (stripped stop query file)

the wiggle parameter for classify_transcripts would still be desirable but can be added when convenient.

ljyanesm commented 2 years ago

@swarbred going back to the wiggle parameter, there are 4 different cases where it could apply start/end of query/target and it could apply in one or more of these cases, what should be the correct behaviour here? Accept wiggle on every case? Only at the end of the query? Only at the end of the target?

swarbred commented 2 years ago

@ljyanesm thanks

So we have

--query_start_hard_filter_distance QUERY_START_HARD_FILTER_DISTANCE If query hit starts after this value, the transcript cannot belong to the Gold category (default: 10) --query_end_hard_filter_distance QUERY_END_HARD_FILTER_DISTANCE If query hit ends after this value, the transcript cannot belong to the Gold category (default: 10) --target_start_hard_filter_distance TARGET_START_HARD_FILTER_DISTANCE If target hit starts after this value, the transcript cannot belong to the Gold category (default: 10) --target_end_hard_filter_distance TARGET_END_HARD_FILTER_DISTANCE If target hit ends after this value, the transcript cannot belong to the Gold category (default: 10)

from https://github.com/EI-CoreBioinformatics/reat/blob/145559927a8d4d10daf970f8e33f864186002393/annotation/scripts/classify_transcripts#L30

my understanding which may be wrong is that these are used in the scoring i.e. in this case better than "10" scores the max and between 10-30 is a scaled score and worse than 30 a zero score, but given the above issue are these NOT used to define the Gold category as indicated in the option help?

So my query back would be could these not define the amount of wiggle? and if that makes sense then the Gold would be transcripts not exceeding the hard filter for any of these limits.

Again my understanding may be wrong here and these cant be used in place of a new wiggle parameter if that is the case then I would have one option that sets the limit for the 4 cases and require all to be met (with gemys fix on the inputs the default of this new option could be zero i.e. no change from currently, but would give us the option to be more permissive).

From your earlier comment it sounds like we require an end to end alignment coverage (i assume gaps mismatches are fine) therefore the above four existing filters don't tie in with the gold classification currently.

Whether we can use the existing 4 options or require a new wiggle option my assumption is that this would be more permissive over what we currently have.

To confirm there is no additional check to indicate the query has a start / stop in defining gold for silver they have to have UTRs I believe, is that also true of gold?

ljyanesm commented 2 years ago

Dear @swarbred,

Sorry for taking so long to reply! The scores are less subtle than what is represented in the command's help (my apologies). Currently, only transcripts obtaining the best possible score are classified as Gold category, that is, perfect scores on all the metrics and not having internal stop codons or missing the start or stop codon as defined by gffread's '-P' option.

The classification script requires some changes to adjust to your above comment, namely reorder when the score is applied to use the hard limits as the command's help suggests.

Using the existing hard filter options as you suggest is the best alternative to provide the 'wiggle' factor. I will make these changes as soon as possible in a branch to be reviewed and merged when we are happy with the results.

To confirm there is no additional check to indicate the query has a start / stop in defining gold for silver they have to have UTRs I believe, is that also true of gold?

There are checks regarding presence of stop codons, or missing start/stop codons as defined by the '-P' option in gffread, as mentioned above. Models do have to have the required UTRs to be classified as Gold.

Sorry again for the slow response, I will make some time to work on this in the next couple of weeks. If you would like this to be completed sooner than that, please do let me know.

Kindly, Luis.

swarbred commented 2 years ago

@ljyanesm Thanks for the comment and sorry for my own delayed reply

perfect scores on all the metrics

I dont think i'm clear about the scoring for max_distance_start_query does within 10bp score the max score of 5? and when you say only those with the best possible score + the additional stop checks form the gold category then doen't that indicate that for max_distance_start_query those starting at say 8bp from the start would have the best score of 5 and could if all the other scoring parameters are met fall in the gold category?

From your earlier comment "100% perfect matches, even a single aa off will be classified as partial," that suggest a higher bar than just achieving the top score based on the scoring parameters below. Unless i'm not thinking correct about how the scoring is applied.

Scoring parameters

max_distance_start_query = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from start for score max_distance_end_query = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from end for score max_distance_start_target = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from start for score max_distance_end_target = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from end for score min_coverage_query = [90, 5, 70] # 90% for hard filter, max_score 5 %cov_qry, 70% to start scoring min_coverage_target = [90, 5, 70] # 90% for hard filter, max_score 5 %cov_tgt, 70% to start scoring max_single_qgap_length = [20, 5, 50] # 20bp for hard filter, max_score 5 * %long_gap_len/qlen, 50bp to start scoring