NCBI-Hackathons / seqr

Creative Commons Zero v1.0 Universal
12 stars 2 forks source link

Make Fasta index/parsing part of solr plugin? #46

Open averagehat opened 8 years ago

averagehat commented 8 years ago

This would be nice because then you could simply do curl <url> -d file.fasta, and it could also be used in other projects, allowing them to index Fastas without using our app. Does this seem like a good idea? I'm not sure what the standard practice is in the Solr world

lianyi commented 8 years ago

Right, we could turn this into a solr tokenizer plugin, or perhaps an update processor if we want to populate one fasta record into multiple field.

-----Original Message----- From: Mike Panciera [notifications@github.commailto:notifications@github.com] Sent: Sunday, September 20, 2015 04:19 PM Eastern Standard Time To: NCBI-Hackathons/seqr Subject: [seqr] Make Fasta index/parsing part of solr plugin? (#46)

This would be nice because then you could simply do curl -d file.fasta, and it could also be used in other projects, allowing them to index Fastas without using our app. Does this seem like a good idea? I'm not sure what the standard practice is in the Solr world

— Reply to this email directly or view it on GitHubhttps://github.com/NCBI-Hackathons/seqr/issues/46.

averagehat commented 8 years ago

Update Processor expects an input document, so it would work to set the fields if the fasta file is already wrapped in JSON or some other format Solr understands (otherwise Solr will error before it becomes an inputdocument). One good thing about this way is that it can be written in a scripting language and dropped into the config folder without having to include a jar. But the client still has to process the fasta files.

In order to avoid that, It looks like we can subclass the UpdateRequestHandler, override the createDefaultLoaders method to add my format extension along with a new ContentStreamLoader for FASTA. Then include it in my solrconfig.xml as a requestHandler. Then Solr will automatically detect the file extension and process it correctly. There is some discussion here: http://andreagazzarini.blogspot.com/2014/12/loading-rdf-ie-custom-data-in-solr.html

This can be included in a jar file.

lianyi commented 8 years ago

Yes, a customized UpdateRequestHandler can be throw in as a solr plugin, which can be used in both server and command-line versions. Perhaps we can add it in the version 2? For the first milestone, I'm thinking we could get the basic indexing/searching features out the door, likely after merging #37..any chance to look into the build failure? Not sure if it's the same issue as the one fixed in the master: the robot test using the target jar during the testing phase, but the jar will be build during the packaging, after the testing completed successfully.