NCBI-Hackathons / seqr

Creative Commons Zero v1.0 Universal
12 stars 2 forks source link

Index should accept CSV/JSON #26

Closed averagehat closed 9 years ago

averagehat commented 9 years ago

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of #23, and the fact that it is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the "index" field and compute the index only if it is not there. In that case the "index" field would be a sort of reserved field that users shouldn't use.

DCGenomics commented 9 years ago

Hi Everyone,

I just want to make sure you guys are comfortable with me moving this repo over to the NCBI hackathon org. this afternoon.

I'll wait one more hour, since Ive seen a lot of recent commits.

Cheers!

BEn

On Thu, Aug 27, 2015 at 12:38 PM, Mike Panciera notifications@github.com wrote:

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of

23 https://github.com/DCGenomics/seqr/issues/23, and the fact that it

is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the index field and compute the index only if it is not there. In that case the index field would be a sort of reserved field that users shouldn't use.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/26.

What have you done today to make the world a better place?

lewisg-ncbi commented 9 years ago

Ben,

What's the downside of the move? Just that we have to check out from another repository?

Best, Lewis

From: DCGenomics [mailto:notifications@github.com] Sent: Thursday, August 27, 2015 2:50 PM To: DCGenomics/seqr seqr@noreply.github.com Subject: Re: [seqr] Index should accept CSV/JSON (#26)

Hi Everyone,

I just want to make sure you guys are comfortable with me moving this repo over to the NCBI hackathon org. this afternoon.

I'll wait one more hour, since Ive seen a lot of recent commits.

Cheers!

BEn

On Thu, Aug 27, 2015 at 12:38 PM, Mike Panciera notifications@github.com wrote:

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of

23 https://github.com/DCGenomics/seqr/issues/23, and the fact that it

is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the index field and compute the index only if it is not there. In that case the index field would be a sort of reserved field that users shouldn't use.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/26.

What have you done today to make the world a better place?

— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/seqr/issues/26#issuecomment-135520461.

averagehat commented 9 years ago

I'd like to use my github account on Travis-CI for this project. We don't have that now; I'm not sure what permissions I will need in order for that to happen within an organization, but it works for the organizations I'm a part of (I'm admin in that organization though.)

DCGenomics commented 9 years ago

Exactly. The upside is that we have some expanded admin capabilities.

Cheers!

Ben On Aug 27, 2015 3:02 PM, "lewisg-ncbi" notifications@github.com wrote:

Ben,

What's the downside of the move? Just that we have to check out from another repository?

Best, Lewis

From: DCGenomics [mailto:notifications@github.com] Sent: Thursday, August 27, 2015 2:50 PM To: DCGenomics/seqr seqr@noreply.github.com Subject: Re: [seqr] Index should accept CSV/JSON (#26)

Hi Everyone,

I just want to make sure you guys are comfortable with me moving this repo over to the NCBI hackathon org. this afternoon.

I'll wait one more hour, since Ive seen a lot of recent commits.

Cheers!

BEn

On Thu, Aug 27, 2015 at 12:38 PM, Mike Panciera notifications@github.com

wrote:

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of

23 https://github.com/DCGenomics/seqr/issues/23, and the fact that

it is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the index field and compute the index only if it is not there. In that case the index field would be a sort of reserved field that users shouldn't use.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/26.

What have you done today to make the world a better place?

— Reply to this email directly or view it on GitHub< https://github.com/DCGenomics/seqr/issues/26#issuecomment-135520461>.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/26#issuecomment-135524538.

lewisg-ncbi commented 9 years ago

Sounds good to me, that is have an update and a create command.

Best, Lewis

From: Mike Panciera [mailto:notifications@github.com] Sent: Thursday, August 27, 2015 12:39 PM To: DCGenomics/seqr seqr@noreply.github.com Subject: [seqr] Index should accept CSV/JSON (#26)

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of #23https://github.com/DCGenomics/seqr/issues/23, and the fact that it is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the index field and compute the index only if it is not there. In that case the index field would be a sort of reserved field that users shouldn't use.

— Reply to this email directly or view it on GitHubhttps://github.com/DCGenomics/seqr/issues/26.

DCGenomics commented 9 years ago

Ive moved it over. Let me know if it has the functionality you guys want. If not, Matt or I will make the appropriate adjustments.

Cheers!

Ben

On Thu, Aug 27, 2015 at 5:33 PM, lewisg-ncbi notifications@github.com wrote:

Sounds good to me, that is have an update and a create command.

Best, Lewis

From: Mike Panciera [mailto:notifications@github.com] Sent: Thursday, August 27, 2015 12:39 PM To: DCGenomics/seqr seqr@noreply.github.com Subject: [seqr] Index should accept CSV/JSON (#26)

I was thinking we could add a create (distinct from index) command to load JSON/CSV/FASTA files to initialize an existing database. But I think this could be included in the index command.

If the database does not exist, this creates a new database, otherwise adding stuff to an existing database (which seems to work out of the box). There is no reason either index couldn't accept any of these file types (differentiate by relying on the extension), and this is useful because of

23https://github.com/DCGenomics/seqr/issues/23, and the fact that it

is hard to store metadata in FASTA format.

Another factor is that the swissprot data already has a computed index field, while other json/fasta files will not. Maybe we should just check for the index field and compute the index only if it is not there. In that case the index field would be a sort of reserved field that users shouldn't use.

— Reply to this email directly or view it on GitHub< https://github.com/DCGenomics/seqr/issues/26>.

— Reply to this email directly or view it on GitHub https://github.com/DCGenomics/seqr/issues/26#issuecomment-135561614.

What have you done today to make the world a better place?

nyetsche commented 9 years ago

The Travis CI permissions required http://docs.travis-ci.com/user/github-oauth-scopes/ are tame, so I think everyone who's a current collaborator should be able to start using TravisCI. Admittedly, I've never used it.

That being said, I just made @lianyi , @lewisg-ncbi , & @averagehat 'admin' for the NCBI-Hackathon/seqr repo, so any of you have extra privileges. You can even add new collaborators!

averagehat commented 9 years ago

@nyetsche I think I need to be a member of the organization hosting the repository in order to add it in travis.

DCGenomics commented 9 years ago

I already added it in travis

On Fri, Aug 28, 2015 at 12:44 PM, Mike Panciera notifications@github.com wrote:

@nyetsche https://github.com/nyetsche I think I need to be a member of the organization hosting the repository in order to add it in travis.

— Reply to this email directly or view it on GitHub https://github.com/NCBI-Hackathons/seqr/issues/26#issuecomment-135829075 .

What have you done today to make the world a better place?

averagehat commented 9 years ago

My mistake, apparently travis-ci urls are case sensitive. The builds are live here: https://travis-ci.org/NCBI-Hackathons/seqr Thanks!

nyetsche commented 9 years ago

:+1:

averagehat commented 9 years ago

@lianyi When inserting documents, is it necessary to use the FindIndex class manually, or is Solr set up to automatically index them? I know the latter is true for searching, but is it true for indexing? Thanks.

lianyi commented 9 years ago

The most recent update won't require FindIndex for indexing. When the sequence provided in the "sequence" field, i.e: {sequence:"AAAAAAA",id:6,...}. It will be automatically tokenized as we used to do in FindIndex.

lianyi commented 9 years ago

also probably we can add an option to allow the user to wipe clean all of the indexes before indexing new FASTA/JSONs. i.e -clean

without this -clean option, it's basically an incremental update mode.