Oshlack / necklace

Combine reference and assembled transcriptomes for RNA-Seq analysis
https://github.com/Oshlack/necklace/wiki
GNU General Public License v3.0
21 stars 5 forks source link

conda related installation issue #16

Closed sid5427 closed 3 years ago

sid5427 commented 3 years ago

Hi,

I downloaded necklace and tried the installation steps. The installation completed successfully but reported that it could not find or install hisat2, stringtie and samtools. I suspect this might be because I was running within a conda environment where I have various bioinformatics tools installed including the 3 above.

Should I just try a separate installation for necklace from "base" environment - i.e. no conda environment active? or edit tools.groovy to point to the path of the tools installed within conda.

Along with the above question - I have two more small queries -

  1. I already have a trinity denovo assembled transcriptome in fasta format and I can pass it to necklace using -p de_novo_assembly_file= option. I am assuming I still have to format the config file as suggested? no special changes needed to it?

  2. Is there anyway I could try using more than one related species' genome and annotation? or am I limited to one?

nadiadavidson commented 3 years ago

Hi,

I would try to install in a clean base environment. Necklace will also try to install conda itself and who know what sort of issues that might cause. If that fails, you could try editing the tools.groovy file manually to point to your installed packages like you suggest.

For your other questions:

  1. Yes that's right. Alternatively you can leave the "-p de_novo_assembly_file=" out and instead put "de_novo_assembly_file=" into your config file if that's easier.

  2. Sorry, I think the documentation is a bit old. Necklace now takes a fasta file of protein sequences from a related species instead of a genome and annotation. So to use multiple related species just cat them all together. e.g.: cat species1.fasta species2.fasta species3.fasta > related_species.fasta and then use proteins_related_species="/related_species.fasta" in the config file

Cheers, Nadia.

sid5427 commented 3 years ago

Hi Nadia,

Thank you for your previous help - the pipeline worked well and we have been able to generate a supertranscript assembly of our rna-seq samples with some potential new genes.

We are in the process of writing up our work and I was wondering if you could explain a bit more on how exactly is the related species protein fasta file being used as opposed to related genome as described in the Necklace paper. I assume the protein sequences are being aligned to a consensus sequence which is generated from a cluster of similar transcripts from denovo and genome guided assembly?

Thanks for your help!

nadiadavidson commented 3 years ago

Hi, I'm so glad you've found Necklace useful in your data analysis!

The protein sequences from related species are used in a fairly similar way as how we described in the original paper. Rather than clustering first and then aligning, Necklace uses the alignment results to cluster. These are the steps in the clustering:

  1. Transcripts in the reference annotation and those found through genome-guided assembled are already clustered together through their genomic positions and a "genome-based superTranscriptome" (which could be considered a type of consensus sequence) is built. To incorporate the de novo assembly (which may include novel sequence and genes), Necklace does the following:
  2. The de novo assembled transcripts are aligned against the protein sequences from the related species
  3. De novo assemblers have a habit of creating false chimeras, where they stick the sequence of two different genes together because they share some sequence. To address this Necklace will "cut" the assembled sequences if it sees multiple independent sections within the sequence that align to different proteins. This is new in the current version.
  4. The de novo assembled transcripts are also aligned against the "genome-based superTranscript".
  5. The alignments of the de novo assembled transcripts to the related species and the "genome-based superTranscript" are used to assign each de novo assembled transcript a gene ID.
  6. Then for each gene ID, all sequences associated with it (genome-based superTranscript and/or de novo assembled transcripts) are used to build a superTranscript. We use a program called "Lace" to do this, which we developed prior to Necklace.

I hope this sort of makes sense and of course there is quite a bit of detail I've left out that I'd be happy to expand on if you need.

Cheers, Nadia.