STAR-Fusion / STAR-Fusion

STAR-Fusion codebase
BSD 3-Clause "New" or "Revised" License
228 stars 80 forks source link

problem with CTAT genome build #289

Open anoronh4 opened 2 years ago

anoronh4 commented 2 years ago

i am seeing the error:

Error, cmd: bash -c "set -euxo pipefail; cat /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf | egrep 'IG_V_gene|IG_C_gene|IG_D_gene|IG_J_gene' | awk '{ if (\$3 == \"exon\") { print } }' | egrep ^chr14  > __gencode_refinement_chkpts/IGH_locus.gtf" died with ret 256 No such file or directory at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/../lib/Pipeliner.pm line 186.
    Pipeliner::run(Pipeliner=HASH(0x55e9885d75a8)) called at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/revise_gencode_annotations.pl line 115
Error, cmd: /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/revise_gencode_annotations.pl --gencode_gtf /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf --out_gtf /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf.revised.gtf died with ret 512 No such file or directory at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/lib/Pipeliner.pm line 186.
    Pipeliner::run(Pipeliner=HASH(0x560a404fd2a8)) called at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl line 460

the singularity cmd i am running with is certainly able to access the gtf file:

singularity exec \
-B /refbuild \
/toolsdir/starfusion_1.10.1.sif \
/usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl \
--genome_fa /refbuild/GRCh37/genome/Homo_sapiens.GRCh37.dna.primary_assembly.fa \
--gtf /refbuild/GRCh37/annotation/Homo_sapiens.GRCh37.75.gtf \
--pfam_db current \
--dfam_db human \
--fusion_annot_lib /refbuild/other/starfusion_resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.source.tar.gz \
--human_gencode_filter \
--annot_filter_rule /refbuild/other/starfusion_resources/AnnotFilterRule.pm

The output file __gencode_refinement_chkpts/IGH_locus.gtf is created but has no content. Could the cause be that egrep ^chr14 returns no matches? The format of the chromosomes is ^14 rather than ^chr14. can that be easily fixed?

brianjohnhaas commented 2 years ago

Hi,

The system was built around gencode data. You can probably get it to work with your inputs, but you might need to remove the --human_gencode_filter parameter. The --human_gencode_filter helps deal with a handful of otherwise difficult to capture fusions with human data.

On Tue, Sep 7, 2021 at 1:38 PM anoronh4 @.***> wrote:

i am seeing the error:

Error, cmd: bash -c "set -euxo pipefail; cat /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf | egrep 'IG_V_gene|IG_C_gene|IG_D_gene|IG_J_gene' | awk '{ if (\$3 == \"exon\") { print } }' | egrep ^chr14 > __gencode_refinement_chkpts/IGH_locus.gtf" died with ret 256 No such file or directory at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/../lib/Pipeliner.pm line 186. Pipeliner::run(Pipeliner=HASH(0x55e9885d75a8)) called at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/revise_gencode_annotations.pl line 115 Error, cmd: /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/revise_gencode_annotations.pl --gencode_gtf /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf --out_gtf /refbuild/GRCh37/starfusion/../annotation/Homo_sapiens.GRCh37.75.gtf.revised.gtf died with ret 512 No such file or directory at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/lib/Pipeliner.pm line 186. Pipeliner::run(Pipeliner=HASH(0x560a404fd2a8)) called at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl line 460

the singularity cmd i am running with is certainly able to access the gtf file:

singularity exec \ -B /refbuild \ /toolsdir/starfusion_1.10.1.sif \ /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/prep_genome_lib.pl \ --genome_fa /refbuild/GRCh37/genome/Homo_sapiens.GRCh37.dna.primary_assembly.fa \ --gtf /refbuild/GRCh37/annotation/Homo_sapiens.GRCh37.75.gtf \ --pfam_db current \ --dfam_db human \ --fusion_annot_lib /refbuild/other/starfusion_resources/GRCh37_gencode_v19_CTAT_lib_Mar012021.source.tar.gz \ --human_gencode_filter \ --annot_filter_rule /refbuild/other/starfusion_resources/AnnotFilterRule.pm

The output file __gencode_refinement_chkpts/IGH_locus.gtf is created but has no content. Could the cause be that egrep ^chr14 returns no matches? The format of the chromosomes is ^14 rather than ^chr14. can that be easily fixed?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/STAR-Fusion/STAR-Fusion/issues/289, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX56P5SLRDZ4YNH74T3UAZEYLANCNFSM5DS3XI7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

anoronh4 commented 2 years ago

gotcha. i have gotten a lot farther with the build but there seems to be a lot of the following in log:

Use of uninitialized value $complex_annot in string ne at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/build_fusion_annot_db_index.pl line 115, <$fh> line 51673332.
Use of uninitialized value $simple_annot in string ne at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/build_fusion_annot_db_index.pl line 112, <$fh> line 51673333.

it doesn't seem to be exiting but seems problematic anyways, and the builder has been generating this type of message in quick succession going on several hours. have you seen this before?

brianjohnhaas commented 2 years ago

If it's a very small fraction of the total number of lines in the fusion annot lib, then I wouldn't worry about it.

On Tue, Sep 14, 2021 at 3:36 PM anoronh4 @.***> wrote:

gotcha. i have gotten a lot farther with the build but there seems to be a lot of the following in log:

Use of uninitialized value $complex_annot in string ne at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/build_fusion_annot_db_index.pl line 115, <$fh> line 51673332. Use of uninitialized value $simple_annot in string ne at /usr/local/src/STAR-Fusion/ctat-genome-lib-builder/util/build_fusion_annot_db_index.pl line 112, <$fh> line 51673333.

it doesn't seem to be exiting but seems problematic anyways. have you seen this before?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/STAR-Fusion/STAR-Fusion/issues/289#issuecomment-919456214, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKX7ZMB6TNFZKEMZTGTLUB6P3ZANCNFSM5DS3XI7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

anoronh4 commented 2 years ago

Honestly I didn't print this to a log and rerunning this may take a few days. however it seemed like thousands and thousands of lines, and it doesn't make sense all lines had three fields. fusion_lib.Mar2021.dat.gz has only 376 rows where the second column is empty and 0 entries where the 3rd column is empty.

i'll update when i have a better idea of the number of these warnings that occurred. just fyi i am using the trinityctat/starfusion:1.10.1 dockerhub image.

brianjohnhaas commented 2 years ago

If there ends up being a problem with that step, we can always revisit it and rebuild later (without having to redo the entire ctat genome build).

Once it's done, I'd suggest running the sample data through it (that ships with star-fusion) and see that the annotations show up.

best,

~b

On Wed, Sep 15, 2021 at 1:42 PM anoronh4 @.***> wrote:

Honestly I didn't print this to a log and rerunning this may take a few days. however it seemed like thousands and thousands of lines, and it doesn't make sense all lines had three fields. fusion_lib.Mar2021.dat.gz has only 376 rows where the second column is empty and 0 entries where the 3rd column is empty.

i'll update when i have a better idea of the number of these warnings that occurred. just fyi i am using the trinityctat/starfusion:1.10.1 dockerhub image.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/STAR-Fusion/STAR-Fusion/issues/289#issuecomment-920230814, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZRKXZVE62Q575O3FLL7GDUCDLJ5ANCNFSM5DS3XI7Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

abhijitswain09 commented 3 months ago

./prep_genome_lib.pl --genome_fa /media/genomiki-sourabh/GSPL-011/hg38/hg38.fa --gtf /media/genomiki-sourabh/GSPL-011/hg38/gencode.v38.chr_patch_hapl_scaff.annotation.gtf --dfam_db human --pfam_db current --CPU 16 --human_gencode_filter --output_dir /media/genomiki-sourabh/GSPL-011/hg38/ -found STAR at /usr/bin/STAR

-found makeblastdb at /usr/bin/makeblastdb

-found blastn at /usr/bin/blastn

-found hmmscan at /usr/bin/hmmscan

homo_sapiens_dfam.hmm 100%[=====================================================================================================================>] 272.23M 1.42MB/s in 86m 8s

2024-05-29 13:44:46 (53.9 KB/s) - ‘homo_sapiens_dfam.hmm’ saved [285449672/285449672]

homo_sapiens_dfam.hmm.h3f 100%[=====================================================================================================================>] 61.66M 28.0KB/s in 28m 0s

2024-05-29 14:12:54 (37.6 KB/s) - ‘homo_sapiens_dfam.hmm.h3f’ saved [64651750/64651750]

homo_sapiens_dfam.hmm.h3i 100%[=====================================================================================================================>] 89.53K 47.5KB/s in 1.9s

2024-05-29 14:12:58 (47.5 KB/s) - ‘homo_sapiens_dfam.hmm.h3i’ saved [91674/91674]

homo_sapiens_dfam.hmm.h3m 100%[=====================================================================================================================>] 101.18M 43.3KB/s in 30m 10s

2024-05-29 14:43:11 (57.2 KB/s) - ‘homo_sapiens_dfam.hmm.h3m’ saved [106096923/106096923]

homo_sapiens_dfam.hmm.h3p 100%[=====================================================================================================================>] 245.15M 1.78MB/s in 50m 56s

2024-05-29 15:34:09 (82.2 KB/s) - ‘homo_sapiens_dfam.hmm.h3p’ saved [257063264/257063264]

Pfam-A.hmm.gz 100%[=====================================================================================================================>] 285.91M 364KB/s in 13m 24s

2024-05-29 15:47:35 (364 KB/s) - ‘Pfam-A.hmm.gz’ saved [299797995]

Building a new DB, current time: 05/29/2024 15:48:42 New DB name: /media/genomiki-sourabh/GSPL-011/hg38/ref_genome.fa New DB title: /media/genomiki-sourabh/GSPL-011/hg38//ref_genome.fa Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 455 sequences in 27.8677 seconds.

abhijitswain09 commented 3 months ago

Error, cannot locate required /media/genomiki-sourabh/GSPL-011/hg38/AnnotFilterRule.pm ... be sure to use a more modern version of the companion CTAT_GENOME_LIB at /media/genomiki-sourabh/GSPL-011/RB_RNA_seq_23.04.24/Dr_Shroff_Data/Raw_Data/input/STAR-Fusion/STAR-Fusion line 519. 00-All-hg38-chr.vcf hg38.dict ref_genome.fa.fai resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf 00-All-hg38-chr.vcf.idx hg38.fa ref_genome.fa.ndb resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.amb 00-All-hg38.vcf hg38.fa.amb ref_genome.fa.nhr resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.ann 00-All-hg38.vcf.idx hg38.fa.ann ref_genome.fa.nin resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.bwt 00-All-hg38.vcf.tbi hg38.fa.bwt ref_genome.fa.not resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.pac 00-All.vcf hg38.fa.fai ref_genome.fa.nsq resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf.sa __chkpts hg38.fa.pac ref_genome.fa.ntf gencode.v38.chr_patch_hapl_scaff.annotation.gtf hg38.fa.sa ref_genome.fa.nto GRCh38_gencode_v44_CTAT_lib_Oct292023.plug-n-play.tar.gz ref_genome.fa ref_genome.fa.star.idx i have this....please resolve my issue ASAP