TrinityCTAT / CTAT-LR-fusion

fusion transcript detection using long reads, leveraging ctat-minimap2 and FusionInspector
Other
13 stars 0 forks source link

CTAT-lr outputs fusions for which "num_LR", "LeftGene", and "RightGene" are empty #4

Closed ljwoods2 closed 3 months ago

ljwoods2 commented 3 months ago

When running CTAT-lr with short-read data, some of the output columns in fusion_predictions.tsv have empty values. I believe this might be a result of an error in FusionInspector which isn't handled correctly, since the number of rows with empty data is the same number of times the following error was thrown by FusionInspector during CTAT-lr's run (though this could be coincidence):

    [939/1050 = 89.4 % done]    Error - no gene spans 100M bases in length.... likely problem at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 669.
        main::get_gene_span_info("chr8\x{9}ENSEMBL\x{9}exon\x{9}13160178\x{9}13160279\x{9}.\x{9}+\x{9}.\x{9}gene_id \"Y_RNA^ENSG"...) called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 436
        main::get_gene_contig_gtf("chr8\x{9}ENSEMBL\x{9}exon\x{9}13160178\x{9}13160279\x{9}.\x{9}+\x{9}.\x{9}gene_id \"Y_RNA^ENSG"..., "/home/tgenref/homo_sapiens/grch38_hg38/hg38_tempe/gene_model/"...) called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 230
        eval {...} called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 226

I attached a an excel sheet (so gh will accept it) with the output rows from fusion_predictions.tsv (with data stripped) to show the empty values. In the tsv, these empty values simply show up as two tabs in a row.

Here's the CTAT-lr arguments that were run (also with information stripped, sorry):


ctat-LR-fusion \
      --CPU 10 \
      --genome_lib_dir <path>\
      -T "<path>/<file>.fastq.gz" \
        --left_fq "<path>/<file>.fastq.gz" \
        --right_fq "<path>/<file>.fastq.gz" \
      --output <path> \
      --vis

This isn't a breaking issue since when loading the tsv into a dataframe you can just filter out rows for which there are empty values in these columns, but I just wanted to make you guys aware in case this wasn't on your radar! I can also try to provide a more detailed minimally reproducible example if that would help, maybe using testfiles in your repo if possible? CTAT-LR issue.xlsx

brianjohnhaas commented 3 months ago

Thanks. I'm not so sure that this is a bug, exactly. Some of the warning or error messages are a bit overly verbose and in some cases irrelevant and I should deal with that separately. If there are fusions where breakpoints are detected and only supported by the short reads, then you'll find some NA values showing up where the long read support would exist. Perhaps that's the main issue here?

best,

Brian

On Mon, Jun 24, 2024 at 12:23 PM ljwoods2 @.***> wrote:

When running CTAT-lr with short-read data, some of the output columns in fusion_predictions.tsv have empty values. I believe this might be a result of an error in FusionInspector which isn't handled correctly, since the number of rows with empty data is the same number of times the following error was thrown by FusionInspector during CTAT-lr's run (though this could be coincidence):

[939/1050 = 89.4 % done]    Error - no gene spans 100M bases in length.... likely problem at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 669.
  main::get_gene_span_info("chr8\x{9}ENSEMBL\x{9}exon\x{9}13160178\x{9}13160279\x{9}.\x{9}+\x{9}.\x{9}gene_id \"Y_RNA^ENSG"...) called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 436
  main::get_gene_contig_gtf("chr8\x{9}ENSEMBL\x{9}exon\x{9}13160178\x{9}13160279\x{9}.\x{9}+\x{9}.\x{9}gene_id \"Y_RNA^ENSG"..., "/home/tgenref/homo_sapiens/grch38_hg38/hg38_tempe/gene_model/"...) called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 230
  eval {...} called at /usr/local/bin/FusionInspector/util/fusion_pair_to_mini_genome_join.pl line 226

I attached a an excel sheet (so gh will accept it) with the output rows from fusion_predictions.tsv (with data stripped) to show the empty values. In the tsv, these empty values simply show up as two tabs in a row.

Here's the CTAT-lr arguments that were run (also with information stripped, sorry):

ctat-LR-fusion \ --CPU 10 \ --genome_lib_dir \ -T "/.fastq.gz" \ --left_fq "/.fastq.gz" \ --right_fq "/.fastq.gz" \ --output \ --vis

This isn't a breaking issue since when loading the tsv into a dataframe you can just filter out rows for which there are empty values in these columns, but I just wanted to make you guys aware in case this wasn't on your radar! I can also try to provide a more detailed minimally reproducible example if that would help, maybe using testfiles in your repo if possible? CTAT-LR issue.xlsx https://github.com/user-attachments/files/15958554/CTAT-LR.issue.xlsx

— Reply to this email directly, view it on GitHub https://github.com/TrinityCTAT/CTAT-LR-fusion/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX6STYNAQVFMZKDZHMDZJBBWTAVCNFSM6AAAAABJ2EX6PWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM3TANRUGEZTMNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

ljwoods2 commented 3 months ago

My mistake, I think I misread the docs. These must be alternative splicing events for which only short read evidence exists. I'll go ahead and close.