BrooksLabUCSC / flair

Full-Length Alternative Isoform analysis of RNA
Other
207 stars 71 forks source link

Flair collapse issue #327

Closed TTT16 closed 6 months ago

TTT16 commented 7 months ago

Copy and paste the exact command you tried to run

SBATCH --mail-type=END,FAIL

SBATCH --mail-user=ttat@houstonmethodist.org

SBATCH --error=/condo/kisslab/tmhtxt53/GridIon/SLURM_OUT/%j.err

SBATCH --output=/condo/kisslab/tmhtxt53/GridIon/SLURM_OUT/%j.out

#######################################################
pwd; hostname; date
# module load mamba

conda activate

module load flair module list # sample_file="/condo/kisslab/tmhtxt53/GridIon/ready_analyzed_samples/sample_names.txt" parent_dir="/condo/kisslab/tmhtxt53/GridIon/ready_analyzed_samples" input_dir="/condo/kisslab/tmhtxt53/GridIon/ready_analyzed_samples/curated-10Samples_fastq" ref="/scratch2/ref/GCA_000001405.15_GRCh38_noALT" ############################################## # cd $parent_dir pwd=$(pwd) mkdir -p flair4_quantify # while IFS= read -r sample; do # echo sample is $sample cd "$sample" || { echo "Error: Unable to enter the folder $sample_name"; exit 1; } echo we are in sample folder pwd # sample_pwd=$(pwd) # folder_name="2023*" full_folder_name=$(find . -maxdepth 1 -type d -name "$folder_name" -print -quit) # inner_folder=$(echo $full_folder_name | awk -F '[/]' '{print $2}') echo "inner folder is: $inner_folder" # cd $sample_pwd/$inner_folder echo we are inside sample and inner folder pwd

      mkdir -p flair4
      align=1
      correct=2
      collapse=3
            flair 123 \
            -g $ref/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
            -t 1024 \
            -r $input_dir/$sample-gridion-final_pass.fastq.gz \
            -f $ref/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf \
            -m \
            -nvrna \
            --generate_map \
            --annotated_bed \
            -o flair4/$sample-flair.output \
            --temp_dir flair4/temp_dir \
            --keep_intermediate
    #       
    cp *.output.isoforms.bed $pwd/flair4_quantify/
cd $pwd

done < "$sample_file" # ENDTIME=$(date +%s) echo "Final End Time is $(date)" echo "It took ($ENDTIME-$STARTTIME) secs" exit 0

What happened?

Hi, I tried to run flair as the code above, and I DID NOT get the output files "...output.isoform.bed", "...output.isoform.fa", and "...output.isoform.gtf", instead of that, we got "...output_cannot_verify.bed"

[tmhtxt53@hmrihpcp03 flair4]$ ls TotalRNA-HEKWT_0h_FAL06539_A-flair.output_all_corrected.bed TotalRNA-HEKWT_0h_FAL06539_A-flair.output_all_inconsistent.bed TotalRNA-HEKWT_0h_FAL06539_A-flair.output.bam TotalRNA-HEKWT_0h_FAL06539_A-flair.output.bam.bai TotalRNA-HEKWT_0h_FAL06539_A-flair.output.bed TotalRNA-HEKWT_0h_FAL06539_A-flair.output_cannot_verify.bed

What is wrong with my code? Thank you, Trinh Please replace this text with the whole error message


We know it's ugly but we promise it helps us solve problems faster.

**What else do we need to know?**
Go ahead, every bit helps.
Jeltje commented 7 months ago

This usually happens when the chromosome names in your genome file do not match the ones in the annotation gtf. This seems unlikely in your case (judging by the names they were retrieved from the same source), but please verify by making sure that GCA_000001405.15_GRCh38_no_alt_analysis_set.fna has a similar name structure (chr1 or just 1) as GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf (first field).

TTT16 commented 7 months ago

They have the same chr1

(/cm/shared/apps/flair/2.0.0) [tmhtxt53@hmrihpcp03 GCA_000001405.15_GRCh38_noALT]$ head GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf

gtf-version 2.2

!genome-build GRCh38.p14

!genome-build-accession NCBI_Assembly:GCA_000001405.15

!annotation-date 10/02/2023

!annotation-source NCBI RefSeq GCF_000001405.40-RS_2023_10

chr1 BestRefSeq gene 11874 14409 . + . gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true"; chr1 BestRefSeq transcript 11874 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; chr1 BestRefSeq exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "1"; chr1 BestRefSeq exon 12613 12721 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "2"; chr1 BestRefSeq exon 13221 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "3";

(/cm/shared/apps/flair/2.0.0) [tmhtxt53@hmrihpcp03 GCA_000001405.15_GRCh38_noALT]$ head GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

TTT16 commented 7 months ago

hi, they have the same "chr1", so the issue(s) must come from something else, right?

(/cm/shared/apps/flair/2.0.0) @.*** GCA_000001405.15_GRCh38_noALT]$ head GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf

gtf-version 2.2

!genome-build GRCh38.p14

!genome-build-accession NCBI_Assembly:GCA_000001405.15

!annotation-date 10/02/2023

!annotation-source NCBI RefSeq GCF_000001405.40-RS_2023_10

chr1 BestRefSeq gene 11874 14409 . + . gene_id "DDX11L1"; transcript_id ""; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1 (pseudogene)"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";

chr1 BestRefSeq transcript 11874 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gbkey "misc_RNA"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript";

chr1 BestRefSeq exon 11874 12227 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "1";

chr1 BestRefSeq exon 12613 12721 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "2";

chr1 BestRefSeq exon 13221 14409 . + . gene_id "DDX11L1"; transcript_id "NR_046018.2"; db_xref "GeneID:100287102"; gene "DDX11L1"; product "DEAD/H-box helicase 11 like 1 (pseudogene)"; pseudo "true"; transcript_biotype "transcript"; exon_number "3";

(/cm/shared/apps/flair/2.0.0) @.*** GCA_000001405.15_GRCh38_noALT]$ head GCA_000001405.15_GRCh38_no_alt_analysis_set.fna

chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

On Tue, Feb 20, 2024 at 1:31 PM Jeltje @.***> wrote:

This usually happens when the chromosome names in your genome file do not match the ones in the annotation gtf. This seems unlikely in your case (judging by the names they were retrieved from the same source), but please verify by making sure that GCA_000001405.15_GRCh38_no_alt_analysis_set.fna has a similar name structure (chr1 or just 1) as GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf (first field).

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/327#issuecomment-1954918805, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAPWNWBA32G6FFCUEEIGILYUT2ZJAVCNFSM6AAAAABDRN7UJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJUHEYTQOBQGU . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 7 months ago

Thanks for checking that. What are the first few lines of cannot_verify.bed?

TTT16 commented 7 months ago

Here is the content of cannot_verify.bed. What should I do next? Thanks!

@.*** flair4]$ head TotalRNA-HEKWT_0h_FAL06539_A-flair.output_cannot_verify.bed

chr11_KI270721v1_random 6092 11780 5e46e250-5d5f-4e18-a158-9733aa379d3c 1 + 6092 11780 27,158,119 4 124,83,74,330, 0,1229,1884,5358,

chr11_KI270721v1_random 6097 11795 b36c537e-3eb0-4244-9d23-e7faf23bde40 1 + 6097 11795 27,158,119 4 119,83,78,345, 0,1224,1879,5353,

chr11_KI270721v1_random 6139 11792 ffc52e08-b645-4fb8-a370-2401e33b9c79 1 + 6139 11792 27,158,119 5 77,83,74,171,342, 0,1182,1837,4666,5311,

chr22_KI270734v1_random 131503 137382 1167e0e4-1218-4811-b90f-fc67f643abba 8

chr22_KI270734v1_random 131600 137395 fde8dff6-0163-4202-9e6c-783c3c14c4f1 11 + 131600 137395 27,158,119 5 155,161,101,141,551, 0,235,3842,4558,5244,

chr22_KI270734v1_random 131603 137393 a1166575-39c9-4228-b27a-1357e8cb378c 1

chr22_KI270734v1_random 131612 137386 2c02365c-972f-46db-b3d1-63b0bec43592 12 + 131612 137386 27,158,119 5 143,161,101,141,542, 0,223,3830,4546,5232,

chr22_KI270734v1_random 131912 137386 7b6f44ec-4da7-4a32-ac8f-619968bb8597 8

chr22_KI270734v1_random 135513 137393 fb3bf38e-48a4-47b2-9097-8aac6957c83c 3

chr22_KI270734v1_random 136156 137387 4783b28a-77e9-4116-9d78-cb9a3be138e2 1

On Wed, Feb 21, 2024 at 10:14 AM Jeltje @.***> wrote:

Thanks for checking that. What are the first few lines of cannot_verify.bed?

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/327#issuecomment-1957151571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAPWNV7656EMAX7CDH45S3YUYMNVAVCNFSM6AAAAABDRN7UJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXGE2TCNJXGE . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 7 months ago

There's something odd about your input arguments: -m is no longer an option in Flair 2.0, and when it was in previous versions it needed an argument. --nvrna needs 2 dashes or it will be ignored --annotated_bed needs an argument or the program will die in the collapse step. You don't actually need --annotated_bed because you're already giving it a gtf input and Flair can use that instead.

Are you using Flair 2.0?

TTT16 commented 7 months ago

So, how my command would be? Could you please suggest?

And I don't know if it is FLAIR 1 or 2.0 since the systems architect in my institution installed it. How can we check the version of flair on HPC? And if possible, could you please send me the github link to FLAIR 2?

Thank you very much for your help, Best, Trinh

On Wed, Feb 21, 2024 at 11:02 AM Jeltje @.***> wrote:

There's something odd about your input arguments: -m is no longer an option in Flair 2.0, and when it was in previous versions it needed an argument. --nvrna needs 2 dashes or it will be ignored --annotated_bed needs an argument or the program will die in the collapse step. You don't actually need --annotated_bed because you're already giving it a gtf input and Flair can use that instead.

Are you using Flair 2.0?

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/327#issuecomment-1957315199, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAPWNWKNENATIYWBZGCPCDYUYSA5AVCNFSM6AAAAABDRN7UJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJXGMYTKMJZHE . You are receiving this because you authored the thread.Message ID: @.***>

TTT16 commented 7 months ago

We checked and the version of flair on our HPC is FLAIR v2.0.0. Could you please suggest the correct command, and what are the human genome.fa and annotation.gtf files we should use for hg38? I have 10 samples including 4 conditions (2 reps, 3 reps, 2 reps, 3 reps), should I concatenate the output files before or after running flair collapse? And to run flair quantify, for the input file (isoform.fa), do I need to concatenate all 10 isoform.fa from 10 samples, or just use 1 sample only?
Thank you very much for your help, Best, Trinh

Jeltje commented 7 months ago

I think I see the problem. You're running Flair sample by sample and that's not what you want.

It's a lot easier if you separate the steps, something like this (please check this, don't just copy):

            flair align \
            -g $ref/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
            -t 30 \
            -r $input_dir/$sample1.fq.gz,$input_dir/$sample2.fq.gz,$input_dir/$sample3.fq.gz \
            --nvrna \
            -o flairout/aligned

            flair correct \
            -g $ref/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \
            -t 30 \
            --query flairout/aligned.bed  \
            -f $ref/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf \
            --nvrna \
            -o flairout/correct

Please note there's no need to give it 1024 CPUs, it's not going to be faster. You will then end up with a large output that needs to be split before submitting to collapse. I'm just working on that, so I'll post again shortly.

TTT16 commented 7 months ago

Hi, I tried to run separate steps of flair, and got these output files. I am not quite sure which one(s) will be helpful or contain what information. Which files I should use for next step? Could you please help? Thank you!

[tmhtxt53@hmrihpcp03 flair6_SR]$ ls chrom_corrected flair6.collapse.annotated_transcripts.isoform.read.map.txt flair6.collapse.isoforms.gtf flair6.aligned.bam flair6.collapse.annotated_transcripts.supported.bed flair6.collapse.unassigned.bed flair6.aligned.bam.bai flair6.collapse.combined.isoform.read.map.txt flair6.collapse.unassigned.fasta flair6.aligned.bed flair6.collapse.firstpass.bed flair6.corrected_all_corrected.bed flair6.collapse.annotated_transcripts.alignment.counts flair6.collapse.firstpass.q.counts flair6.corrected_all_inconsistent.bed flair6.collapse.annotated_transcripts.alignment.mm2_stderr flair6.collapse.firstpass.unfiltered.bed flair6.corrected_cannot_verify.bed flair6.collapse.annotated_transcripts.alignment.sam flair6.collapse.isoform.read.map.txt temp_dir flair6.collapse.annotated_transcripts.bed flair6.collapse.isoforms.bed flair6.collapse.annotated_transcripts.fa flair6.collapse.isoforms.fa

Jeltje commented 7 months ago

It looks like you ran all three steps (align, correct, and collapse), so you should be finished. Your results should be flair6.collapse.isoforms.bed, and flair6.collapse.isoforms.fa, the files you couldn't find after your original run.

TTT16 commented 7 months ago

Hi, Here are all the output files I got:

chrom_corrected flair6.collapse.annotated_transcripts.bed flair6.collapse.firstpass.unfiltered.bed flair6.corrected_all_corrected.bed

flair6.aligned.bam flair6.collapse.annotated_transcripts.fa flair6.collapse.isoform.read.map.txt flair6.corrected_all_inconsistent.bed

flair6.aligned.bam.bai flair6.collapse.annotated_transcripts.isoform.read.map.txt flair6.collapse.isoforms.bed flair6.corrected_cannot_verify.bed

flair6.aligned.bed flair6.collapse.annotated_transcripts.supported.bed flair6.collapse.isoforms.fa flair_diffExp

flair6.collapse.annotated_transcripts.alignment.counts flair6.collapse.combined.isoform.read.map.txt flair6.collapse.isoforms.gtf flair.quantify.counts.tsv

flair6.collapse.annotated_transcripts.alignment.mm2_stderr flair6.collapse.firstpass.bed flair6.collapse.unassigned.bed temp_dir

flair6.collapse.annotated_transcripts.alignment.sam flair6.collapse.firstpass.q.counts flair6.collapse.unassigned.fasta

I am not sure which files I should use for next steps (flair_diffExp), could you please help?

And, I also could not find where to download the flair_diffExp. Could you please send me the github link to download it?

Thank you very much for your help.

Best,

Trinh

On Mon, Mar 4, 2024 at 2:16 PM Jeltje @.***> wrote:

It looks like you ran all three steps (align, correct, and collapse), so you should be finished. Your results should be flair6.collapse.isoforms.bed, and flair6.collapse.isoforms.fa, the files you couldn't find after your original run.

— Reply to this email directly, view it on GitHub https://github.com/BrooksLabUCSC/flair/issues/327#issuecomment-1977380444, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALAPWNQWBZIB4HJXAUT4MHLYWTJANAVCNFSM6AAAAABDRN7UJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZXGM4DANBUGQ . You are receiving this because you authored the thread.Message ID: @.***>

Jeltje commented 6 months ago

Please follow the instructions at https://flair.readthedocs.io/en/latest/