datahubio / datahub-v2-pm

Project management (issues only)
8 stars 2 forks source link

[Epic] Full text search #89

Closed zelima closed 6 years ago

zelima commented 6 years ago

As a Consumer looking for data I want to get a dataset on topic X so that I can use it for my work

As a Consumer looking for data I want to be able to search with relevant terms and see if there are datasets available that are related

What's the situation now:

Acceptance Criteria

Tasks

Analysis

Search system has several parts:

Problem: we don't index the readme atm so we can't search it

=> to change indexing needs changing mapping (or moving the readme into datahub.description) => either of these need a reload of the datapackages => editing dump to s3 assembler + a rerun of all flows [painful and complex] => a deeper analysis of the issue => should we re-architect a bit

Solutions:

Questions

How the load to metastore works today

2 parts

Where and when does ES index get set up?

Adding documents

graph LR

dp[Data Package] --> dumptos3(Dump to S3 Pipeline)
dumptos3 --> weirddp[DP with single resource<br/>with a single row that is the object for ES]
weirddp --> dumptoes(Write to ES Pipeline)

what's not working about this

How to re-architect

graph TD

raw[Raw Data] --> factory[Factory]
factory --> finished[Finished Product: Data Package on BitStore]
finished -.done.-> foreman[Foreman]
foreman --> indexer[Indexing Service]
indexer --put--> metastore[MetaStore]

Do we push notify the ES index system or does the index system listen?

Requirements:

Going forward

https://github.com/datahq/assembler/blob/master/datapackage_pipelines_assembler/processors/dump_to_s3.py#L17

Repos to work with:

Mikanebu commented 6 years ago

Could we consider search in dashboard page? There is a situation when you cannot find dataset on users dashboard's search box, but it can be found on the main search page. Is this related to this issue or i need to create separate issue?

zelima commented 6 years ago

@Mikanebu no this is not related. Please open separated issue in frontend repo

zelima commented 6 years ago

FIXED.

Finance and climate present neither in title nor in README so none of them result as expected. That may be probably solved with keywords. But not related to this issue