Flair collapse requires >1TB of RAM

oneillkza commented 1 year ago

Copy and paste the exact command you tried to run

flair collapse -q merged_alignments.bed             --genome hg38_no_alt.fa             --reads merged.fastq            --gtf Homo_sapiens.GRCh38.100.remapped.gtf          --threads 24        --output "F117211_flair"

How did you install Flair? (We'd prefer it if you used one of the top two because they are the least likely to have package compatibility problems.)

singularity pull docker:quay.io/biocontainers/samtools:1.16.1--h6899075_1

(ie the Biocontainers repackaging of the Bioconda installation).

What happened?

Writing temporary files to /projects/koneill_scratch/nextflow_scratch/tmpqwztmrfm/  
Annotated ends extracted from GTF
Read data extracted
slurmstepd: error: Job 37840129 exceeded memory limit (761779600 > 209715200), being killed
slurmstepd: error: Exceeded job memory limit
slurmstepd: error: *** JOB 37840129 ON n304 CANCELLED AT 2022-11-24T12:30:47 ***

What else do we need to know? The merged corrected bed file was around 6.5GB. This was one single PromethION flowcell's worth of sequence, so not really that much.

I was trying to get a pooled collapsed set of transcripts so as to do quantification across a set of samples.

I was later able to run this on a high-memory server with 1.4TB of RAM. NextFlow tells me it used 1.005 TB of RAM. I can potentially run flair just on the super-high-memory servers, but this does prevent me from running it on our HPC cluster, and seems unreasonable that it needs this much for this little input data.

Jeltje commented 1 year ago

Thanks for the detailed description. We're aware of the issue and are planning to work on it when time and grants permit. Meanwhile the simplest way to reduce memory requirements is by splitting the input bed by chromosome.

Hope that helps!

oneillkza commented 1 year ago

Thanks @Jeltje. That's useful to know.

Unfortunately we're interested in fusion transcripts, so splitting by chromosome doesn't help that much.

I guess we could do something like pull out the reads that span multiple chromosomes (probably by getting a list of read names from the original bam), then split the remaining reads by chromosome.

Jeltje commented 1 year ago

I'm not sure if you can properly collapse fusion reads without adding the expected fusion to your genome. By default Flair only considers the best match of a read, so if its 5' and 3' ends map to different chromosomes only the longest alignment gets used.

I don't think bed-12 format allows for fusion transcripts, so how do you intend to identify them?

oneillkza commented 1 year ago

BED12 is just representing alignments, and there can be multiple alignments per read. Or in the case of collapsed transcripts, multiple alignments per transcript. ie for a fusion transcript there are multiple lines in the BED12 file, all with the same value in the name column.

(At least, when I've been doing this the long way, by de novo assembly of transcripts -> alignment -> BED12, it comes out like that.)

(Also, I guess most fusions would only have two lines, but I've been working on some projects looking at HPV integrations, which tend to include multiple fragmented copies of the HPV genome, hence more than two lines in the BED12.)

OK, that is useful to know that Flair explicitly does not handle fusions.

Jeltje commented 1 year ago

There is a fusion_dist parameter that we were about to remove. It allows a second alignment for reads, I'm not sure if that alignment has to cover a different section (I just had a quick glance and that code is really dense).

Flair collapse does rename the transcripts (unless they do not get merged) so I'm not sure how you can trace the fusion reads.

oneillkza commented 1 year ago

Thanks -- I did see that parameter, but the documentation implies that it's only for same-chromosome fusions.

Anyway, I don't think the issue is with Flair collapse renaming transcripts. The issue is with Flair discarding all but the longest alignment for a transcript. If it could accept the multiple alignments as belonging to the same transcript, then it would be able to handle fusions.

e.g. I've worked a bit with rnabloom, and it gives each transcript a unique name in the output fasta. You can align that to a reference and then trace the fusion transcripts through their name. And that fasta assembly can also be used to align the raw reads against for quantification (e.g. using salmon). The only problem there is that rnabloom doesn't do a reference-guided assembly, so I need to find (or code up) a means of matching the assembled transcripts to a reference.

BrooksLabUCSC / flair

Flair collapse requires >1TB of RAM #230