UofTCoders / studyGroup

Welcome to the University of Toronto Coders!
https://uoftcoders.github.io/
Other
100 stars 124 forks source link

Bioinformatics topic ideas #7

Closed mbonsma closed 8 years ago

mbonsma commented 9 years ago

Suggestions and ideas for topics we'd like to see covered at some point. This thread is for bioinformatics-related topics, or things that people working with genetic data might find useful. Posting a suggestion does not lock you in to presenting on that topic.

ricardoharripaul commented 9 years ago

Variant Calling Machine learning with Genetic Data Building Biological Databases Advanced Statistical Tools would be good for everyone

mbonsma commented 9 years ago

SeqIO in BioPerl or BioPython. My vote's for Python, but maybe it could be taught in a way that's applicable to both if people are interested in both? edit Here's a possible lesson plan

linamnt commented 9 years ago

A bit more specific to people who do network analysis but tools such as iGraph in R, or Networkx in python might be useful for some?

ricardoharripaul commented 9 years ago

yeah R also has a package for for SeqIO.

I wrote some code in python that can work with BAM files and fastq/a files. I used it to extract out certain types of reads, filtering. You can also use it for motif finding.

On Mon, Jun 15, 2015 at 2:56 PM, mbonsma notifications@github.com wrote:

SeqIO http://biopython.org/wiki/SeqIO in BioPerl or BioPython. My vote's for Python, but maybe it could be taught in a way that's applicable to both if people are interested in both?

— Reply to this email directly or view it on GitHub https://github.com/mbonsma/studyGroup/issues/7#issuecomment-112173166.

mbonsma commented 9 years ago

This is so great. Ideas! If we even had a session where everyone who works with fasta files just talks about their life, I would be so happy. Haha.

MattStata commented 9 years ago

I've been using NetworkX in Python for a while now, to do a variety of things, particularly focused around the use of the MCL algorithm for network clustering. I'd love to learn more about visualizing networks using NetworkX + matplotlib if anyone has any expertise in that? Or just matplotlib in general, really.

MattStata commented 9 years ago

Also, I could present on de novo transcriptome assembly or phylogenomics (the two big areas I've been devoting time to lately), as well as general stuff like BLAST and variant tools, sequence alignment, building gene trees, etc, if there's interest in beginner type stuff.

ricardoharripaul commented 9 years ago

I have experience in matplotlib. It is quite nice to use a script for generating figures because you can change things systematically and easily for different journals.

ricardoharripaul commented 9 years ago

As well,

I can do variant calling, bisulfite alignment, and reference guided alignment. I seem to be doing that a log lately.

MattStata commented 9 years ago

Oh great! We should chat. I could use some recommendations for variant calling, but in a highly specific context.

ricardoharripaul commented 9 years ago

Sure.

On Wed, Jun 17, 2015 at 10:58 PM, MattStata notifications@github.com wrote:

Oh great! We should chat. I could use some recommendations for variant calling, but in a highly specific context.

— Reply to this email directly or view it on GitHub https://github.com/mbonsma/studyGroup/issues/7#issuecomment-113022297.

QuLogic commented 9 years ago

@MattStata @ricardoharripaul #2 ...

ricardoharripaul commented 9 years ago

Hi Matt,

Did you have any questions? I am not sure how you want to communicate.

On Wed, Jun 17, 2015 at 11:06 PM, Elliott Sales de Andrade < notifications@github.com> wrote:

@MattStata https://github.com/MattStata @ricardoharripaul https://github.com/ricardoharripaul #2 https://github.com/mbonsma/studyGroup/issues/2 ...

— Reply to this email directly or view it on GitHub https://github.com/mbonsma/studyGroup/issues/7#issuecomment-113022935.

MattStata commented 9 years ago

Well basically, I'm looking for a program that can use RNA-seq reads against a set of coding sequences I've assembled and identify putative SNPs. Do you have any recommendations? I've never used any SNP-calling software before so I'm not really sure what's out there and what the required inputs for most are -- I would assume the majority map genomic reads against a genome, rather than RNA-seq against CDS.

As for communicating, I think here is probably fine, as long as this doesn't turn into a really lengthy side-discussion and totally derail the thread.

ricardoharripaul commented 9 years ago

Hi Matt,

So you already know your SNPs and have their sequence or position? What are you mapping against? There is no reference and nothing similar?

It makes a difference. Your problem reminds me of a targeted sequencing problem.

Ricardo

On Thu, Jun 18, 2015 at 9:55 PM, Matt Stata notifications@github.com wrote:

Well basically, I'm looking for a program that can use RNA-seq reads against a set of coding sequences I've assembled and identify putative SNPs. Do you have any recommendations? I've never used any SNP-calling software before so I'm not really sure what's out there and what the required inputs for most are -- I would assume the majority map genomic reads against a genome, rather than RNA-seq against CDS.

As for communicating, I think here is probably fine, as long as this doesn't turn into a really lengthy side-discussion and totally derail the thread.

— Reply to this email directly or view it on GitHub https://github.com/UofTCoders/studyGroup/issues/7#issuecomment-113338854 .

MattStata commented 9 years ago

I have no reference. I have de novo transcriptome assemblies for two plant species, from which I've extracted orthologous pairs of coding sequences. I would like to now use the original reads from three different individuals to get some idea of the genetic diversity and in particular the degree of heterozygosity, in the interest of deciding whether we need to self the plants several times to reduce heterozygosity before starting a genome sequencing project. I could write something to do this using BLAT results for the read mapping or something, but if there is an existing tool that would save me some trouble. I imagine there must be something that is either intended for this or flexible enough to use in this situation?

ricardoharripaul commented 9 years ago

Hi Matt,

I am pretty sure BLAT would be too slow. Did you use Trinity or ABYSS for de novo assembly? I am wondering if you can use like a bowtie and present the scaffolds from the de novo assemble as the reference and map like that.

Have you looked into PAGAN?

https://code.google.com/p/pagan-msa/wiki/PAGAN?tm=6

On Sat, Jun 20, 2015 at 11:02 AM, Matt Stata notifications@github.com wrote:

I have no reference. I have de novo transcriptome assemblies for two plant species, from which I've extracted orthologous pairs of coding sequences. I would like to now use the original reads from three different individuals to get some idea of the genetic diversity and in particular the degree of heterozygosity, in the interest of deciding whether we need to self the plants several times to reduce heterozygosity before starting a genome sequencing project. I could write something to do this using BLAT results for the read mapping or something, but if there is an existing tool that would save me some trouble. I imagine there must be something that is either intended for this or flexible enough to use in this situation?

— Reply to this email directly or view it on GitHub https://github.com/UofTCoders/studyGroup/issues/7#issuecomment-113777353 .

MattStata commented 9 years ago

BLAT actually works quite well for mapping reads, and can be really fast with the right settings and run in parallel with gnu parallel. I've used it quite a bit for that.

My pipeline, which I'm still refining, is something like this:

-Multiple assemblies (Trinity, IDBA, SOAPDeNovo-Trans) combined -Predict ORFs with EMBOSS "getorf" and take the three longest per assembled transcript, above a certain minimum length threshold, using a Python script (this of course introduces a lot of spurious ORFs, but they're filtered out over the next steps). -Remove duplicate ORFs with another Python script -Merge highly similar ORFs and take a single representative using CD-HIT in order to reduce redundancy -BLAST the ORFs for my two species against each other and take reciprocal best hits to further reduce redundancy -BLAST what remains against selected genomes in the Phytozome v10.2, and eliminate anything without a good match, to remove totally spurious ORFs that might remain -Cluster the results of the two BLAST comparisons using MCL in order to group my assemblies with their orthologs in other species for functional annotation

Basically this is all aimed at getting a good set of pairwise orthologs with predicted function, so that I can then do further downstream analysis with interspecific hybrids of these two species. But as I mentioned, we also intend to eventually sequence the genomes of the two parent species and so would like to get a rough idea of the degree of heterozygosity so as to decide whether we need to self a few more times before starting sequencing. So I'd like to see what SNPs exist in my coding regions (particularly percentage of SNPs at synonymous sites, which should be a reasonable approximation of SNPs for the other neutral parts of the genome) for each species.

PAGAN sounds interesting, but I don't see how it fits my problem -- were you suggesting it just as an alternative to BLAT?

ricardoharripaul commented 9 years ago

I was suggesting PAGAN instead of BLAT

Have you seen this paper? It implements a statistical approach to estimate heterozygosity.

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.0020041

Is it possible to get frequencies from your data on SNPs?

https://books.google.ca/books?id=hn9ivqTpSKAC&pg=PA371&lpg=PA371&dq=computationally+find+heterozygosity&source=bl&ots=iVNaa3HwUL&sig=e0jy5i7GY9v7azbFoJee_uwa8Nw&hl=en&sa=X&ei=jOSGVd-1AYiUyQS355TIBg&ved=0CEUQ6AEwBQ#v=onepage&q=computationally%20find%20heterozygosity&f=false

Ricardo

On Sat, Jun 20, 2015 at 2:48 PM, Matt Stata notifications@github.com wrote:

BLAT actually works quite well for mapping reads, and can be really fast with the right settings and run in parallel with gnu parallel. I've used it quite a bit for that.

My pipeline, which I'm still refining, is something like this:

-Multiple assemblies (Trinity, IDBA, SOAPDeNovo-Trans) combined -Predict ORFs with EMBOSS "getorf" and take the three longest per assembled transcript, above a certain minimum length threshold, using a Python script (this of course introduces a lot of spurious ORFs, but they're filtered out over the next steps). -Remove duplicate ORFs with another Python script -Merge highly similar ORFs and take a single representative using CD-HIT in order to reduce redundancy -BLAST the ORFs for my two species against each other and take reciprocal best hits to further reduce redundancy -BLAST what remains against selected genomes in the Phytozome v10.2, and eliminate anything without a good match, to remove totally spurious ORFs that might remain -Cluster the results of the two BLAST comparisons using MCL in order to group my assemblies with their orthologs in other species for functional annotation

Basically this is all aimed at getting a good set of pairwise orthologs with predicted function, so that I can then do further downstream analysis with interspecific hybrids of these two species. But as I mentioned, we also intend to eventually sequence the genomes of the two parent species and so would like to get a rough idea of the degree of heterozygosity so as to decide whether we need to self a few more times before starting sequencing. So I'd like to see what SNPs exist in my coding regions (particularly percentage of SNPs at synonymous sites, which should be a reasonable approximation of SNPs for the other neutral parts of the genome) for each species.

PAGAN sounds interesting, but I don't see how it fits my problem -- were you suggesting it just as an alternative to BLAT?

— Reply to this email directly or view it on GitHub https://github.com/UofTCoders/studyGroup/issues/7#issuecomment-113803920 .