RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

Bam file conversion questions #1486

Closed ChrisMantegna closed 2 years ago

ChrisMantegna commented 2 years ago

My BWEEMS friend Anamica (https://github.com/NankaBD) is looking for support with the following problem:

"Is anyone here familiar with PacBio raw reads? I am struggling to understand how to download bam and bam.pbi. A collaborator converted the files to fastq for me, but I am unable to load them into Geneious because my RAM can’t take it"

Thank you!

kubu4 commented 2 years ago

because my RAM can’t take it

Well, to steal the line from Jaws, "You're gonna need a bigger computer."

Getting into bioinformatics requires a large amount of storage space and a fair amount of RAM for many "basic" processes.

With that said, somewhere in Geneious is a way to change the amount of available RAM for Geneious to have access to. I haven't used Geneious in ~10yrs, so can't remember where to access that setting. So, if they can figure that out, it might help for a bit.

And, now comes my reproducibility component...

As nice as Geneious is (it is VERY nice - we used to use it and love it), it would be very beneficial for Anamica to start exploring/using command line tools. Doing so allows the science to be more open/transparent; improving people's abilities to troubleshoot. Additionally, it makes things easier to reproduce, since using open source command line tools are available to everyone.

ABediSilva commented 2 years ago

Hi @kubu4! Thanks for taking the time to help me. Sadly, you've said everything I have feared. Fortunately, I do have some command line skills and access to a supercomputer (shout out MANA here at the University of Hawai‘i). That being said,I guess I have to move the bam files from my sequencing center's server to my server. What a pain. Do you have any additional advice on mapping pacbio raw reads to reference assemblies?

kubu4 commented 2 years ago

That being said,I guess I have to move the bam files from my sequencing center's server to my server. What a pain.

This is almost always part of the process. Sequencing facilities don't have the storage space to keep everyone's data forever, so retrieving your data always has to happen. However, the process is less painful each time you do it. :)

the bam files

If you have BAM files, then that suggests the reads have already been aligned (mapped) to your reference. So, they may have already done that step for you! (P.S. I'd ask them for the deets on how they did it - program(s) and version(s), corresponding settings, reference file(s) used, etc).

sr320 commented 2 years ago

I have seen Bams just be raw reads (presuming aligned to each other).

What is the format of all raw files provided to you? and what is the end goal.. eg SNPs, expression, methylation?

ABediSilva commented 2 years ago

We originally PacBio for SV analysis and, since I had the data, I was hoping to toss this info to add to SNP analysis on my Illumina-derived genomes. However, I was just informed that these sequences were generated from PacBio CLR, not HiFi as I previously thought. SNP calling does not seems to be worth the bother.

sr320 commented 2 years ago

Curious- what is difference in CLR and Hi-fi?

ABediSilva commented 2 years ago

I have a very preliminary understanding, so I included a link to a pdf that made things a bit clearer for me. I am also linking a PacBio glossary in case someone finds this thread and finds the glossary helpful. Continuous Long Read (CLR) PacBio sequencing tends to be more error-prone than Continuous Circular Reads (CCR)/HiFi sequencing. I have been told the CLR is pretty crummy for SNP calling, although it can be done. If anyone has experience in using CLR derived reads for SNP calling then please let me know how it went.

https://www.ndsu.edu/pubweb/~mcclean/plsc411/Pacific%20Biosciencs%20CCS%20vs%20CLR%20modes.pdf https://www.pacb.com/wp-content/uploads/2015/09/Pacific-Biosciences-Glossary-of-Terms.pdf

kubu4 commented 2 years ago

I have been told the CLR is pretty crummy for SNP calling, although it can be done

Right. This is probably due to the potential for low sequencing depth across all regions of the genome.

If you don't mind, per @sr320's suggestion, could you provide a basic overview of the project, what you're trying to achieve, and why (i.e. decision process) you decided to follow the path you're on?

I think this will help us get a better idea of what you're trying to accomplish and potentially come up with more ideas on how to use the data that you have.

ABediSilva commented 2 years ago

I am comparing genomes of microbial cell lines that have immunity against viral infection to cell lines that are susceptible to infection. Until recently, mutations associated with immunity were found to be be affiliated with SNPs in hypervariable regions of microbial genomes. To assess this, we sequenced the genomes of 22 of out immune and susceptible cell lines via Illumina. Recently, a paper came out with evidence showing that immunity may be achieved trough chromosomal rearrangement in a related species' genome. That study used PFGE to asses rearrangements but we decided to try sequencing with PacBio to find SVs.
The SV analysis was carried out by the bioinformatics team on our grant. The SNP analysis was left up to me. I've been able to find some interesting things using the Illumina data. However, it would be nice to have more genomes in order to corroborate some patterns. I thought I could fold in the PacBio sequences to my SNP analysis but it's looking less and less likely.

kubu4 commented 2 years ago

we sequenced the genomes of 22 of out immune and susceptible cell lines via Illumina

Would you be able identify SNPs using this data?

ChrisMantegna commented 2 years ago

Just checked in with Anamica and she has everything she needs. Thank you both for all of your help. :) I'm closing the thread.