Closed swarbred closed 2 years ago
Looking at a different run on the protist and there we have a spread of gold silver and bronze classified transcripts.
so not the same negative results as the marigold run above (where all were silver)
The only clear difference i see is that in marigold the set of input --homology_proteins
lacked stops where in the case of the protists the stops were present.
so for marigold the earlier run (pre seqkit) the reat created protein file MATCHED the input file i.e. stop codon include and this resulted in mixed classified transcripts (gold, silver and bronze) where as when the reat created proteins file (post seqkit change) DID NOT MATCH the input proteins (i.e. the reat protein file contained the stop and the input lacked them) then we got only silver classified transcripts.
So classify seems very sensitive to something (i.e. including the final stop codon in the protein file) that I wouldn't expect to have such a dramtic difference.
Currently gold category is reserved to 100% perfect matches, even a single aa off will be classified as partial, there could be a 'wiggle' parameter added or some other check, but as defined, the scenario above would result in no gold models
We should have a user available wiggle parameter and set a default of say 3, otherwise cases like this where the db and the query set dont match regarding if the stop is included will have a major affect. Thanks again
@gemygk will make the change to remove any terminal stops from the input (db) proteins sequence and also from the query proteins post seqkit i.e. change
and
That way for the diamond alignment both the db and the query proteins lack the stop and are consistent, and we fix this for the marigold run (to get this through I just stopped the pipeline corrected the file and restarted so it used the corrected (stripped stop query file)
the wiggle parameter for classify_transcripts
would still be desirable but can be added when convenient.
@swarbred going back to the wiggle parameter, there are 4 different cases where it could apply start/end of query/target and it could apply in one or more of these cases, what should be the correct behaviour here? Accept wiggle on every case? Only at the end of the query? Only at the end of the target?
@ljyanesm thanks
So we have
--query_start_hard_filter_distance QUERY_START_HARD_FILTER_DISTANCE If query hit starts after this value, the transcript cannot belong to the Gold category (default: 10)
--query_end_hard_filter_distance QUERY_END_HARD_FILTER_DISTANCE If query hit ends after this value, the transcript cannot belong to the Gold category (default: 10)
--target_start_hard_filter_distance TARGET_START_HARD_FILTER_DISTANCE If target hit starts after this value, the transcript cannot belong to the Gold category (default: 10)
--target_end_hard_filter_distance TARGET_END_HARD_FILTER_DISTANCE If target hit ends after this value, the transcript cannot belong to the Gold category (default: 10)
my understanding which may be wrong is that these are used in the scoring i.e. in this case better than "10" scores the max and between 10-30 is a scaled score and worse than 30 a zero score, but given the above issue are these NOT used to define the Gold category as indicated in the option help?
So my query back would be could these not define the amount of wiggle? and if that makes sense then the Gold would be transcripts not exceeding the hard filter for any of these limits.
Again my understanding may be wrong here and these cant be used in place of a new wiggle parameter if that is the case then I would have one option that sets the limit for the 4 cases and require all to be met (with gemys fix on the inputs the default of this new option could be zero i.e. no change from currently, but would give us the option to be more permissive).
From your earlier comment it sounds like we require an end to end alignment coverage (i assume gaps mismatches are fine) therefore the above four existing filters don't tie in with the gold classification currently.
Whether we can use the existing 4 options or require a new wiggle option my assumption is that this would be more permissive over what we currently have.
To confirm there is no additional check to indicate the query has a start / stop in defining gold for silver they have to have UTRs I believe, is that also true of gold?
Dear @swarbred,
Sorry for taking so long to reply! The scores are less subtle than what is represented in the command's help (my apologies). Currently, only transcripts obtaining the best possible score are classified as Gold category, that is, perfect scores on all the metrics and not having internal stop codons or missing the start or stop codon as defined by gffread's '-P' option.
The classification script requires some changes to adjust to your above comment, namely reorder when the score is applied to use the hard limits as the command's help suggests.
Using the existing hard filter options as you suggest is the best alternative to provide the 'wiggle' factor. I will make these changes as soon as possible in a branch to be reviewed and merged when we are happy with the results.
To confirm there is no additional check to indicate the query has a start / stop in defining gold for silver they have to have UTRs I believe, is that also true of gold?
There are checks regarding presence of stop codons, or missing start/stop codons as defined by the '-P' option in gffread, as mentioned above. Models do have to have the required UTRs to be classified as Gold.
Sorry again for the slow response, I will make some time to work on this in the next couple of weeks. If you would like this to be completed sooner than that, please do let me know.
Kindly, Luis.
@ljyanesm Thanks for the comment and sorry for my own delayed reply
perfect scores on all the metrics
I dont think i'm clear about the scoring for max_distance_start_query does within 10bp score the max score of 5? and when you say only those with the best possible score + the additional stop checks form the gold category then doen't that indicate that for max_distance_start_query those starting at say 8bp from the start would have the best score of 5 and could if all the other scoring parameters are met fall in the gold category?
From your earlier comment "100% perfect matches, even a single aa off will be classified as partial," that suggest a higher bar than just achieving the top score based on the scoring parameters below. Unless i'm not thinking correct about how the scoring is applied.
max_distance_start_query = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from start for score max_distance_end_query = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from end for score max_distance_start_target = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from start for score max_distance_end_target = [10, 5, 30] # 10bp for hard filter, max_score 5, 30bp from end for score min_coverage_query = [90, 5, 70] # 90% for hard filter, max_score 5 %cov_qry, 70% to start scoring min_coverage_target = [90, 5, 70] # 90% for hard filter, max_score 5 %cov_tgt, 70% to start scoring max_single_qgap_length = [20, 5, 50] # 20bp for hard filter, max_score 5 * %long_gap_len/qlen, 50bp to start scoring
seqkit translate outputs the stop codon if present, this is different to gffread which removes the final stop.
This has a knock on affect when using
classify_transcripts
with all models classified as silver (no gold or bronze)The only difference between the two diamond files is the 1 amino acid longer query protein in the post seqkit diamond file (first example (394 aa) compared to the second example, with the protein from gffread (393 aa)).
That additional amino acid on the query length causes all query modes to be classified as silver.
cat head_test_diamond.hits.tsv
mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A103XHT5 394 393 80.4 393 77 0 1 393 1 393 1.30e-233 647 87.3 6EDDXEXHXHXQXQXLXLT2SF1QM3FS3DE5PS3SP2IV5MR14SL3IL3SP7NS3SN1SK2TA10KR6HP1HY5NK1IV12RL20NT8KRKR2QR2NY4GVKRPS13GT10LV3RGSA2ED1IK7MS7GD2QH29VT6EANK2FY11VD3TA2VS1GVEDWG1TC8RH4VL11KR6LV12FC1TS2RG1IV3LF5VLFY1SI1SNSG1FS7VE6YI3LM8FH2 mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A251U8X2 394 383 78.7 375 80 0 19 393 9 383 4.12e-216 603 86.4 2SQHKSTFS1TS1DN2QESYVA1TP2SQ8MR16RY1IL3SP2IV1ST2NT3SC1SK2TA16CSHS1HYDH3DNNH1IVND32NT8KRKN1NSQR2NY2FR1GAKL2WC8DQ2GT3KR2GE3LV2ND1SA2ES1IS1KR5MY6GDGA2QH20RK8VM6EANK2FY2VI8VD6VA3WS1TC7NV10GR25LFFC1TS2RG1IL3LF4VI1FY6FS5FV1VE4YC1YF10IN1FP2
(/ei/software/testing/reat/dev_prediction_module/x86_64) [swarbred@EI-HPC interactive Test]$ cat head_test_ori_diamond.hits.tsv
mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A103XHT5 393 393 80.4 393 77 0 1 393 1 393 1.25e-233 647 87.3 6EDDXEXHXHXQXQXLXLT2SF1QM3FS3DE5PS3SP2IV5MR14SL3IL3SP7NS3SN1SK2TA10KR6HP1HY5NK1IV12RL20NT8KRKR2QR2NY4GVKRPS13GT10LV3RGSA2ED1IK7MS7GD2QH29VT6EANK2FY11VD3TA2VS1GVEDWG1TC8RH4VL11KR6LV12FC1TS2RG1IV3LF5VLFY1SI1SNSG1FS7VE6YI3LM8FH2 mikado.all.Calendula_officinalis_EIV1.2_Contig_0000001G1002.1 A0A251U8X2 393 383 78.7 375 80 0 19 393 9 383 3.96e-216 603 86.4 2SQHKSTFS1TS1DN2QESYVA1TP2SQ8MR16RY1IL3SP2IV1ST2NT3SC1SK2TA16CSHS1HYDH3DNNH1IVND32NT8KRKN1NSQR2NY2FR1GAKL2WC8DQ2GT3KR2GE3LV2ND1SA2ES1IS1KR5MY6GDGA2QH20RK8VM6EANK2FY2VI8VD6VA3WS1TC7NV10GR25LFFC1TS2RG1IL3LF4VI1FY6FS5FV1VE4YC1YF10IN1FP2
classify_transcripts --evalue_filter 1.0E-6 --min_pct_cds_fraction 0.5 --max_tp_utr_complete 1 --max_tp_utr 2 --min_tp_utr 1 --max_fp_utr_complete 2 --max_fp_utr 3 --min_fp_utr 1 -b head_test_diamond.hits.tsv --query_start_hard_filter_distance 10 --query_start_score 5 --query_start_scoring_distance 30 --query_end_hard_filter_distance 10 --query_end_score 5 --query_end_scoring_distance 30 -t test_all_models.clustered.gff --target_start_hard_filter_distance 10 --target_start_score 5 --target_start_scoring_distance 30 --target_end_hard_filter_distance 10 --target_end_score 5 --target_end_scoring_distance 30 --min_query_coverage_hard_filter 90 --min_query_coverage_score 5 --min_query_coverage_scoring_percentage 30 --min_target_coverage_hard_filter 90 --min_target_coverage_score 5 --min_target_coverage_scoring_percentage 30 --max_single_gap_hard_filter 20 --max_single_gap_score 5 --max_single_gap_scoring_length 30; myhist
test_all_models.clustered.gff.txt head_test_diamond.hits.tsv.txt head_test_ori_diamond.hits.tsv.txt