Trinotate / Trinotate.github.io

web documentation for Trinotate
47 stars 17 forks source link

Tweak to make more HPC friendly? #2

Closed vika19 closed 5 years ago

vika19 commented 5 years ago

Hello!

We have trinotate installed as a module on our cluster here at Indiana University. We have it installed in a central location, with the databases being linked in a folder with an environmental variable attached. This works great for the first part of the trinotate workflow, as users can point to the databases as such:

blastx -query transcripts.fa -db $TRINOTATEDB/uniprot_sprot.pep -num_threads 16 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastx.outfmt6

However, this does not work out for the final report generation - as I understand it, the only way to get the xls fully populated is to have the databases in the working directory. This is based on the report missing information the final several columns unless I symlink the databases into the working directory - at which point everything works as expected.

We don't really want users creating their own copy of the databases as it causes strain on our storage to have that many redundant copies. We can automatically symlink the files into the directory, but that has it's own pitfalls as well. Is there a point at which we could point the scripts to the central install directory rather that having it look in the working directory only? I looked around a bit, but figured it would be faster to ask - perl isn't exactly my native coding language.

In future, the option to point to where the databases exist would be a much welcome option for the report scripts ^_^. It would make it much more HPC friendly!

Thanks for the awesome software and I look forward to hearing from you! S

brianjohnhaas commented 5 years ago

Hi Sheri,

There shouldn't be any need for the databases to be in the current directory, as far as I know. Are there specific commands that are failing when you give the full path to any target input? If so, that should be easy enough to change on our end. I need to put out a new release by the end of the month anyway, so just let me know what's needed.

best,

~brian

On Tue, Aug 7, 2018 at 11:59 AM Sheri Sanders notifications@github.com wrote:

Hello!

We have trinotate installed as a module on our cluster here at Indiana University. We have it installed in a central location, with the databases being linked in a folder with an environmental variable attached. This works great for the first part of the trinotate workflow, as users can point to the databases as such:

blastx -query transcripts.fa -db $TRINOTATEDB/uniprot_sprot.pep

-num_threads 16 -max_target_seqs 1 -outfmt 6 -evalue 1e-3 > blastx.outfmt6

However, this does not work out for the final report generation - as I understand it, the only way to get the xls fully populated is to have the databases in the working directory. This is based on the report missing information the final several columns unless I symlink the databases into the working directory - at which point everything works as expected.

We don't really want users creating their own copy of the databases as it causes strain on our storage to have that many redundant copies. We can automatically symlink the files into the directory, but that has it's own pitfalls as well. Is there a point at which we could point the scripts to the central install directory rather that having it look in the working directory only? I looked around a bit, but figured it would be faster to ask - perl isn't exactly my native coding language.

In future, the option to point to where the databases exist would be a much welcome option for the report scripts ^_^. It would make it much more HPC friendly!

Thanks for the awesome software and I look forward to hearing from you! S

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Trinotate/Trinotate.github.io/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX9Td7GqigWhSzZf5ZMZnARC9wuUJks5uObl8gaJpZM4VycQo .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

vika19 commented 5 years ago

Never fails, as soon as you point out an error, it fixes itself ^_^. Someone else on the team rebuilt the databases (suggestions in other forums), and that seems to have fixed the issue - not the symlinks. I just didn't know that at the time! I'm continuing to test to make sure this is sane, but I am getting output without the databases symlinked now.

This solution makes a lot more sense to me, as I thought the info would be in the database at time of loading, not at dump to the report.

Would updating the database builds also explain why I'm getting different GO term output for the same transcript the first and second time I ran the load scripts?

brianjohnhaas commented 5 years ago

Gotcha. The GO terms are extracted from swissprot, and the fasta databases are generated by Trinotate during the initial boilerplate sqlite building process.

If you want to use a more current version of swissprot, then you'd need to rerun the Trinotate boilerplate sqlite builder. The sqlite db is always tied to a specific sprot db, and so you shouldn't mix/match them. Also, you need to search the sprot fasta file that Trinotate generates for you, as the identifiers are specially formatted.

hope this helps,

~b

On Tue, Aug 7, 2018 at 2:18 PM Sheri Sanders notifications@github.com wrote:

Never fails, as soon as you point out an error, it fixes itself ^_^. Someone else on the team rebuilt the databases (suggestions in other forums), and that seems to have fixed the issue - not the symlinks. I just didn't know that at the time! I'm continuing to test to make sure this is sane, but I am getting output without the databases symlinked now.

This solution makes a lot more sense to me, as I thought the info would be in the database at time of loading, not at dump to the report.

Would updating the database builds also explain why I'm getting different GO terms for the same transcript?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Trinotate/Trinotate.github.io/issues/2#issuecomment-411152628, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX7-N0WxtjfJuVhK5kEb2EgdUzbS5ks5uOdnrgaJpZM4VycQo .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas

vika19 commented 5 years ago

Thanks. I was aware of this aspect, which is part of why we wanted to keep it centralized - we can update just one location and keep things current. Thanks for the help ^_^.

brianjohnhaas commented 5 years ago

Sure thing.

On Tue, Aug 7, 2018 at 2:26 PM Sheri Sanders notifications@github.com wrote:

Thanks. I was aware of this aspect, which is part of why we wanted to keep it centralized - we can update just one location and keep things current. Thanks for the help ^_^.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Trinotate/Trinotate.github.io/issues/2#issuecomment-411155073, or mute the thread https://github.com/notifications/unsubscribe-auth/AHMVX6jH5C3Iwna_BJSTbK5bRORQ9KZ2ks5uOdvEgaJpZM4VycQo .

--

Brian J. Haas The Broad Institute http://broadinstitute.org/~bhaas http://broad.mit.edu/~bhaas