TrinityCTAT / ctat-mutations

Mutation detection using GATK4 best practices and latest RNA editing filters resources. Works with both Hg38 and Hg19
https://github.com/TrinityCTAT/ctat-mutations
Other
73 stars 18 forks source link

RNAEditing Analysis #107

Open kokyriakidis opened 2 years ago

kokyriakidis commented 2 years ago

Hi @brianjohnhaas

Q1: I want to use ctat-mutations to find RNA Editing events. I have only RNA Seq data. Does your implementation find most of the variants around all genomic regions or does it have a preference for variants in specific regions of the genome (e.g. coding). I am telling this because my data have high intron mapping for some reason.

Q2: Should I use a boosting method for my use case? If yes, which one do you propose? I saw in your code that it removes RNAEditing column when it prepares for boosting. Does that mean that the variants that are annotated with RNAEDIT will not be in the final vcf file? And this vcf file will likely not contain other RNAEditing events because they will be filtered by the model? Do you think that boosting will likely remove potential RNAEditing events from the vcf because they will be considered as FPs?

Q3: Let's say I do not use boosting. Do you thing HC hard filters will likely remove potential RNAEditing sites from the vcf because they will be considered as FPs by the filters?

Q4: Is it in your plan to update REDIportal to V2?

Thanks!

brianjohnhaas commented 2 years ago

hi,

responses below

On Mon, Jan 17, 2022 at 3:06 AM Konstantinos Kyriakidis < @.***> wrote:

Hi @brianjohnhaas https://github.com/brianjohnhaas

Q1: I want to use ctat-mutations to find RNA Editing events. I have only RNA Seq data. Does your implementation find most of the variants around all genomic regions or does it have a preference for variants in specific regions of the genome (e.g. coding). I am telling this because my data have high intron mapping for some reason.

it doesn't restrict to specific genomic regions

Q2: Should I use a boosting method for my use case? If yes, which one do you propose?

The default 'xgboost' tends to work the best.

Q3: Is it in your plan to update REDIportal to V2?

we will plan to update rediportal later this year for the next release. It requires that we redo our benchmarking and testing.

best of luck!

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX76JRBTFSSZTRQ46IDUWPEXNANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

Hi @brianjohnhaas

I get the following errors when I try to run the pipeline

[2022-01-19 05:51:12,10] [error] Failed to hash "/ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz": Cannot hash file /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz because it can't be found

[2022-01-19 05:51:12,10] [error] d062b032:annotate_variants_wf.annotate_splice_distance:-1:1: Hash error (Cannot hash file /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz because it can't be found), disabling call caching for this job.

There in no such file inside the GRCh38.mutation_lib_supplement.Jul272020. For that reason the workflow fails.

Could not localize /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz -> /output/C4E/cromwell-executions/ctat_mutations/9c151408-da17-4155-afa9-0b83b9d81bbd/call-AnnotateVariants/annotate_variants_wf/d062b032-21b1-4df8-943a-03a20344967a/call-annotate_splice_distance/inputs/-1376602862/ref_annot.splice_adj.bed.gz:
    /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz doesn't exist
    File not found /output/C4E/cromwell-executions/ctat_mutations/9c151408-da17-4155-afa9-0b83b9d81bbd/call-AnnotateVariants/annotate_variants_wf/d062b032-21b1-4df8-943a-03a20344967a/call-annotate_splice_distance/inputs/-1376602862/ref_annot.splice_adj.bed.gz -> /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz
    File not found /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz
    File not found /ctat_genome_lib_dir/ctat_mutation_lib/ref_annot.splice_adj.bed.gz
    at common.validation.Validation$ValidationTry$.toTry$extension1(Validation.scala:94)
    at common.validation.Validation$ValidationTry$.toTry$extension0(Validation.scala:90)
    at cromwell.backend.standard.StandardAsyncExecutionActor.instantiatedCommand(StandardAsyncExecutionActor.scala:668)
    ... 35 more

[2022-01-19 05:51:47,86] [info] WorkflowManagerActor WorkflowActor-9c151408-da17-4155-afa9-0b83b9d81bbd is in a terminal state: WorkflowFailedState
[2022-01-19 05:52:10,91] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2022-01-19 05:52:15,35] [info] SingleWorkflowRunnerActor writing metadata to /tmp/tmpn8b6knbi.json
[2022-01-19 05:52:15,37] [info] Workflow polling stopped
[2022-01-19 05:52:15,39] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2022-01-19 05:52:15,39] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2022-01-19 05:52:15,39] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2022-01-19 05:52:15,39] [info] Aborting all running workflows.
[2022-01-19 05:52:15,39] [info] 0 workflows released by cromid-007ba5c
[2022-01-19 05:52:15,39] [info] JobExecutionTokenDispenser stopped
[2022-01-19 05:52:15,39] [info] WorkflowStoreActor stopped
[2022-01-19 05:52:15,40] [info] WorkflowLogCopyRouter stopped
[2022-01-19 05:52:15,40] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2022-01-19 05:52:15,40] [info] WorkflowManagerActor stopped
[2022-01-19 05:52:15,40] [info] WorkflowManagerActor All workflows finished
[2022-01-19 05:52:15,54] [info] Connection pools shut down
[2022-01-19 05:52:15,54] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2022-01-19 05:52:15,54] [info] SubWorkflowStoreActor stopped
[2022-01-19 05:52:15,54] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2022-01-19 05:52:15,54] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2022-01-19 05:52:15,54] [info] KvWriteActor Shutting down: 0 queued messages to process
[2022-01-19 05:52:15,54] [info] JobStoreActor stopped
[2022-01-19 05:52:15,54] [info] CallCacheWriteActor stopped
[2022-01-19 05:52:15,54] [info] IoProxy stopped
[2022-01-19 05:52:15,54] [info] ServiceRegistryActor stopped
[2022-01-19 05:52:15,55] [info] DockerHashActor stopped
[2022-01-19 05:52:15,58] [info] Database closed
[2022-01-19 05:52:15,58] [info] Stream materializer shut down
[2022-01-19 05:52:15,59] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2022-01-19 05:52:15,59] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2022-01-19 05:52:15,59] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2022-01-19 05:52:15,59] [info] Shutting down connection pool: curAllocated=0 idleQueues.size=0 waitQueue.size=0 maxWaitQueueLimit=256 closed=false
[2022-01-19 05:52:15,60] [info] WDL HTTP import resolver closed
Workflow 9c151408-da17-4155-afa9-0b83b9d81bbd transitioned to state Failed
kokyriakidis commented 2 years ago

My bad! The integrations step did not run properly the first time!

kokyriakidis commented 2 years ago

Annotate BLAT ED takes several hours to complete (more than 5) and uses only one cpu core. I see that pblat is multithreaded. Shouldn't it use all specified cores?

brianjohnhaas commented 2 years ago

hi,

Can you see the pblat command that's running? ie. psauxww | grep pblat

and does it have threads set to 1 ?

The workflow is supposed to propagate the CPU parameter setting to the command here. I'm curious if it's not happening.

best,

~b

On Thu, Jan 20, 2022 at 5:09 AM Konstantinos Kyriakidis < @.***> wrote:

Reopened #107 https://github.com/NCIP/ctat-mutations/issues/107.

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#event-5923623594, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX2UXENOOWSGEIYUJDDUW7NPJANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

Hmm... It starts to run using all cores and after an 1-2 hours it continues with only one core. Then, it takes several hours to finish.

brianjohnhaas commented 2 years ago

ok, I'm wondering if it's the pblat step that's to blame here, or if it's our code that analyzes the results of pblat. I'll have to look into it some more. Most of that step should be using multiple threads. It is time consuming though - one of the slowest steps in the whole pipeline.

On Thu, Jan 20, 2022 at 11:44 AM Konstantinos Kyriakidis < @.***> wrote:

Hmm... It starts to run using all cores and after an 1-2 hours it continues with only one core. Then, it takes several hours to finish.

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1017702082, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX4JEV5VCK5K6ASRQ7LUXA3XZANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

It is probably your code:

+ echo '########### Annotate BLAT ED #############'
+ /usr/local/src/ctat-mutations/src/annotate_ED.py --input_vcf /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/inputs/1608256378/AD3E.splice_distance.vcf.gz --output_vcf AD3E.blat_ED.vcf --reference /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/inputs/817306935/ref_genome.fa --temp_dir /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/tmp.a535223e --threads 15
15:24:57 : INFO : 
################################
 Annotating VCF: Calculating ED 
################################

15:24:57 : INFO : Processing VCF Positions
15:25:03 : INFO : Running samtools faidx
15:25:07 : INFO : Running Blat
15:59:34 : INFO : Processing Output
16:00:25 : INFO : Creating ED features

It started 15:24 and it is now 19:22 and it is still running using 1 core.

brianjohnhaas commented 2 years ago

We had done some testing on this step to ensure it was using multiple threads and once upon a time it did scale. Leave this open and we'll revisit it before the next release.

On Thu, Jan 20, 2022 at 12:22 PM Konstantinos Kyriakidis < @.***> wrote:

It is probably your code:

  • echo '########### Annotate BLAT ED #############'
  • /usr/local/src/ctat-mutations/src/annotate_ED.py --input_vcf /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/inputs/1608256378/AD3E.splice_distance.vcf.gz --output_vcf AD3E.blat_ED.vcf --reference /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/inputs/817306935/ref_genome.fa --temp_dir /output/AD3E/cromwell-executions/ctat_mutations/6162ec39-8de3-4584-889f-cac0f9a5d904/call-AnnotateVariants/annotate_variants_wf/d4df6ce4-e64b-43ab-a1d3-8f2ff9c302cd/call-annotate_blat_ED/tmp.a535223e --threads 15 15:24:57 : INFO : ################################ Annotating VCF: Calculating ED ################################

15:24:57 : INFO : Processing VCF Positions 15:25:03 : INFO : Running samtools faidx 15:25:07 : INFO : Running Blat 15:59:34 : INFO : Processing Output 16:00:25 : INFO : Creating ED features

It started 15:24 and it is now 19:22 and it is still running using 1 core.

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1017740311, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX45WE2YWIHOOU3EQFDUXBAHFANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

Just for the record, it did 12h to complete. From Annotate BLAT ED to finish.

brianjohnhaas commented 2 years ago

in the meantime, if you don't need boosting (can set boosting method = 'none'), then you don't necessarily require the ED annotation. You can turn off the annotations that you don't in the main wdl under WDL/ ctat_mutations.wdl

annotation options

    Boolean incl_snpEff = true

    Boolean incl_dbsnp = true

    Boolean incl_gnomad = true

    Boolean incl_rna_editing = true

    Boolean include_read_var_pos_annotations = true

    Boolean incl_repeats = true

    Boolean incl_homopolymers = true

    Boolean incl_splice_dist = true

    Boolean incl_blat_ED = true

    Boolean incl_cosmic = true

    Boolean incl_cravat = true

On Fri, Jan 21, 2022 at 2:41 AM Konstantinos Kyriakidis < @.***> wrote:

Just for the record, it did 12h to complete. From Annotate BLAT ED to finish.

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1018257750, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXYFRC7WBQ3SICLHUSLUXEEYZANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

I was looking for something like this! Thanks so much for bringing it up!

brianjohnhaas commented 2 years ago

sure thing!

On Fri, Jan 21, 2022 at 9:37 AM Konstantinos Kyriakidis < @.***> wrote:

I was looking for something like this! Thanks so much for bringing it up!

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1018559772, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX4U5OTIOOCNTZDRDSTUXFVRFANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

Hi @brianjohnhaas !

What does ED=-1 means?

brianjohnhaas commented 2 years ago

Pretty sure this means it's overly repetitive - it's the default in case blat doesn't return hits, which would mean that it has more hits than are within the range of reporting. @Maxwell Brown @.***> , can you confirm?

Are you seeing a ton of these?

On Tue, Jan 25, 2022 at 2:43 PM Konstantinos Kyriakidis < @.***> wrote:

Hi @brianjohnhaas https://github.com/brianjohnhaas !

What does ED=-1 means?

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1021545105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6UVLRR2QYT26EWKQDUX34N5ANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

kokyriakidis commented 2 years ago

I see a lot of them but not a ton.

WHat about entropy? Can you please explain what it means and how can I evaluate it?

brianjohnhaas commented 2 years ago

entropy is just a measure of sequence complexity. If it's close to 2.0, then it's highly complex. If it's closer to 1 or below, then it's low complexity (like a polyA run).

We use all these annotations for boosting prediction accuracy with the downstream machine learning steps (ie. XGBoost).

On Thu, Jan 27, 2022 at 12:22 PM Konstantinos Kyriakidis < @.***> wrote:

I see a lot of them but not a ton.

WHat about entropy? Can you please explain what it means and how can I evaluate it?

— Reply to this email directly, view it on GitHub https://github.com/NCIP/ctat-mutations/issues/107#issuecomment-1023463695, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6EZUJJAXBH2FUL63DUYF5O5ANCNFSM5MD76ALA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas