PATRIC3 / patric3_website

Legacy PATRIC Website (JBoss Portal Version)
MIT License
5 stars 2 forks source link

FASTA header prefix demon #1245

Closed aswarren closed 8 years ago

aswarren commented 8 years ago

This has been a problem with RNA-Seq and now the variation service. And really any program that wants to match the sequence ID's in our FASTA files with the sequence ID's in our GFF file. So we might have difficulty uploading our own files on our own site. We modeled our fasta headers after NCBI: >accn|CP001362 Such that our fasta headers don't match the sequence ID's for our GFF or our internal reference scheme (JBrowse and elsewhere): CP001362 Refseq recently fixed this e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Acetobacter_cerevisiae/latest_assembly_versions/GCF_001580535.1_ASM158053v1/GCF_001580535.1_ASM158053v1_genomic.fna.gz. Really it just requires that we drop the accn| prefix so that we can quit making special accommodations. We should fix it on the API and FTP generation simultaneously.

rkenyon commented 8 years ago

Maulik, Bob, Harry, or whoever, what is required to make this change in the database? What other side effects would the change have with the current system?

Ron

On 11/9/16 11:58 PM, Andrew Warren wrote:

This has been a problem with RNA-Seq and now the variation service. And really any program that wants to match the sequence ID's in our FASTA files with the sequence ID's in our GFF file. We modeled our fasta headers after NCBI: >accn|CP001362 Such that our fasta headers don't match the sequence ID's for our GFF or our internal reference scheme (JBrowse and elsewhere): CP001362 Refseq recently fixed this e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Acetobacter_cerevisiae/latest_assembly_versions/GCF_001580535.1_ASM158053v1/GCF_001580535.1_ASM158053v1_genomic.fna.gz. Really it just requires that we drop the accn| prefix so that we can quit making special accommodations. We should fix it on the API and FTP generation simultaneously.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PATRIC3/patric3_website/issues/1245, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCnWo-AaRXkEuY5PfcO6iGP6-jFtNa7ks5q8qRogaJpZM4KuR_O.Web Bug from https://github.com/notifications/beacon/ADCnWr0FosiYZgeCGam5F98bmHWEX0f3ks5q8qRogaJpZM4KuR_O.gif

Ron Kenyon PATRIC Project Manager, patricbrc.org Project Director, Biocomplexity Institute Virginia Tech rkenyon@vbi.vt.edu

mshukla1 commented 8 years ago

I believe the immediate change needed is in the API function that generates fasta file used as input by the services.

If we want to minimize the changes, we can just customize that function.

We can make similar changes in all other places as well, if we want to, I.e download files on FTP and other seq downloads from the website.

Sent from my iPhone

On Nov 10, 2016, at 7:01 AM, rkenyon notifications@github.com wrote:

Maulik, Bob, Harry, or whoever, what is required to make this change in the database? What other side effects would the change have with the current system?

Ron

On 11/9/16 11:58 PM, Andrew Warren wrote:

This has been a problem with RNA-Seq and now the variation service. And really any program that wants to match the sequence ID's in our FASTA files with the sequence ID's in our GFF file. We modeled our fasta headers after NCBI: >accn|CP001362 Such that our fasta headers don't match the sequence ID's for our GFF or our internal reference scheme (JBrowse and elsewhere): CP001362 Refseq recently fixed this e.g. ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Acetobacter_cerevisiae/latest_assembly_versions/GCF_001580535.1_ASM158053v1/GCF_001580535.1_ASM158053v1_genomic.fna.gz. Really it just requires that we drop the accn| prefix so that we can quit making special accommodations. We should fix it on the API and FTP generation simultaneously.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PATRIC3/patric3_website/issues/1245, or mute the thread https://github.com/notifications/unsubscribe-auth/ADCnWo-AaRXkEuY5PfcO6iGP6-jFtNa7ks5q8qRogaJpZM4KuR_O.Web Bug from https://github.com/notifications/beacon/ADCnWr0FosiYZgeCGam5F98bmHWEX0f3ks5q8qRogaJpZM4KuR_O.gif

Ron Kenyon PATRIC Project Manager, patricbrc.org Project Director, Biocomplexity Institute Virginia Tech rkenyon@vbi.vt.edu

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

aswarren commented 8 years ago

We should take pains to maintain parity between the FTP files and the files generated by the API. We don't want someone getting different results if they downloaded one of our files from the ones being generated by the API because of this issue.

mshukla1 commented 8 years ago

Agree. I need to recreate most of the download files anyway after we update functions. I will try to have updated files on the FTP as part of the Nov Data Release.

-Maulik

On Nov 10, 2016, at 1:03 PM, Andrew Warren notifications@github.com wrote:

We should take pains to maintain parity between the FTP files and the files generated by the API. We don't want someone getting different results if they downloaded one of our files from the ones being generated by the API because of this issue.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/PATRIC3/patric3_website/issues/1245#issuecomment-259777548, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLd788-0Jw9_AaEO5XfCMQXHeH6SlsCks5q82qUgaJpZM4KuR_O.

mshukla1 commented 8 years ago

Updated all current fna files to change id format from >accn|xxxxx to >xxxxx.

Fixed the script that generates the download files.