Hi, I have a big RNA-seq sample(.fastq.gz,~84G; .bam, ~61G). After I ran flair correct, I got the corrected.bed file (3.6G). It is >1G so I split the bed file by chromosomes and then I ran flair collapse by chromosomes separately. While the result gtf file is odd. For example, I ran flair collapse on chr1 and then I got the chr1.gtf. But there are transcripts on other chromosomes besides chr1.

Below is my codes to correct reads:


# 1. bam2bed
bam2Bed12 -i $input >$output/$sample/preprocess/$sample.bed
# 2. correct
flair correct --threads 10 -q $output/$sample/preprocess/$sample.bed \
    -f $anno -g $ref --output $output/$sample/preprocess/$sample

3. split bed by chromosome

awk -F"\t" '$1!~"chr" || $1=="chrM"' $output/$sample/preprocess/${sample}_all_corrected.bed >$output/$sample/preprocess/patches.bed

for i in {1..22} X Y do chr=chr$i awk -F"\t" '$1=="'$chr'"' $output/$sample/preprocess/${sample}_all_corrected.bed >$output/$sample/preprocess/$chr.bed done


- Below is my code to run flair collapse:

flair collapse -t 15 \ -q $output/$sample/preprocess/$chr.bed \ -r $root/raw.reads/$sample.fastq.gz -f $anno -g $ref \ --stringent --check_splice --annotation_reliant generate \ -o $output/$sample/$chr


- After `flair collapse` finished, I check the gtf file of chr1 while there are many transcripts on other chromosomes besides chr1:

cut -f1 chr1.isoforms.gtf |sort | uniq -c

62102 chr1 345 chr10 690 chr11 484 chr12 158 chr13 423 chr14 343 chr15 409 chr16 586 chr17 125 chr18 618 chr19 626 chr2 326 chr20 149 chr21 225 chr22 567 chr3 306 chr4 429 chr5 459 chr6 486 chr7 336 chr8 364 chr9 37 chrM 379 chrX 11 chrY



- Should the `raw fastq` file that input to `flair collapse` be consistent to the `bed` file? I mean, should I extract the reads that appeared in the `bed` file after splitting?

BrooksLabUCSC / flair

Odd result after splitting bed by chromosomes #345

3. split bed by chromosome