ducoquelicot / observ_admin

All admin and back-end for Observ
0 stars 0 forks source link

learn about ElasticSearch #5

Closed ducoquelicot closed 5 years ago

ducoquelicot commented 5 years ago

First step: figure out how to get the pdfs in the elasticsearch database. [Question: do we need to put the pdfs in the SQLite database that I have created as well?]

Solution: use Ingest Attachment in combination with ElasticSearch. See https://beenje.github.io/blog/posts/parsing-and-indexing-pdf-in-python/ for an explanation of how to use Ingest Attachment (works on Apache Tika). I can write a script that does this.

Question: which properties do we want to include in the indexing? Choose from content, title, author, keywords, date, content_type, content_length, language.

Other solution: using the script I created last quarter to convert pdfs to text and write a script in the Flask app that imports the existing PDFs into the SQLite framework that we're also using for the users. Plus of this is a) I already have the script, which means there's less figuring out to do and b) it'll be accessible through the SQLalchemy framework, which means that we can use db.Models and all that jazz that I've learned to use. It would also make it easier, I think, to show the content of the database on the website. It would also make it easier to connect the user who made the query to the actual query and as such, would probably make the 'subscription' process easier as well.

Another pro is that it would save me the step of having to create an ElasticSearch cluster with ingest nodes, because I find all that shit v confusing.

Possible con: how are we going to add metadata?

Possible solution to con: create a FlaskForm or something similar to get the documents in the SQLite database with different elements - such as the city, doctype, etc - and then use that to index the document. Thinking of a form with

index='palo_alto' << city name doctype='agenda' << document type body ='body' << document content in .txt format

workflow

When all the documents are both in the SQLite and the ElasticSearch database, a search function will be needed.

I'm taking all this as a variation on the way the Flask tutorial on ElasticSearch works. I don't know if there's an easier solution, especially because I think having the documents in the SQLite database is going to prove useful for the reasons listed above.

ducoquelicot commented 5 years ago

Indexing data from binary uses a lot of resources, so elastic recommends using an ingest node dedicated to this pipeline.

A node that has node.ingest set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as node.ingest: false.

ducoquelicot commented 5 years ago

Step two: create an Elasticsearch cluster with ingest nodes that will be used and run the script against this cluster.

Questions: how do I create a cluster? How do I create nodes? I tried to Google these answers but I find everything highly, highly confusing.

ducoquelicot commented 5 years ago

Step three: once all the pdfs I currently have are indexed, perform several keyword searches to check how it works.

Note: I cannot use Kibana because my VM will literally freeze to death every time I open ElasticSearch with the X-packages involved, and I need to have those in order to use Kibana. Unfortunately.

ducoquelicot commented 5 years ago

Step four: update the search page so that the output of the search page will be parsed into a function that executes the search.

ducoquelicot commented 5 years ago

Step five: run a test on the search form to check whether it works.

ducoquelicot commented 5 years ago

Step six: do a pyTest unit test on the search function to reconfirm that it works.

ducoquelicot commented 5 years ago

Step seven: figure out a way for people to subscribe to a specific search.

Ideas:

Would also need to figure out a way to connect the search with the user email and create a complete package with search query + frequency + email that would then be used to execute the search at specific intervals and then sends the update to the user [this is going to be a separate issue]

ducoquelicot commented 5 years ago

Optional step eight: figure out a way to show the search results for their active subscriptions on the user's profile page.

zstumgoren commented 5 years ago

Hey, Nice documentation of your thought/learning process ;)

So we can whiteboard/discuss at scrum, but I think your "alternative solution" touches on the sanest path:

As a next step, we should try to hunt down a Python library that makes it easy to perform CRUD operations on ElasticSearch documents. These operations will be integrated into various parts of your application (e.g. data loaders will need to create/update, while the Flask app will need "read" or search capabilities).

Also, let's plan to stick with the most basic Elastic setup possible for development purposes (a single-node cluster) on your local machine. When the prototype is complete, if you choose to put it in production, we can use one of the managed services from AWS or GCP. These services cost money, however, so you may want to ultimately leave this as an open-source prototype that others can build on.

zstumgoren commented 5 years ago

Oh, I would follow these instructions for a local install: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html#_installation_example_on_linux

You should also install Kibana.

ducoquelicot commented 5 years ago

I installed Kibana but as I mentioned, it crashes on me. My VM can hardly handle ElasticSearch as is.

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer


Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:28:56 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Oh, I would follow these instructions for a local install: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html#_installation_example_on_linux

You should also install Kibana.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484186476, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTKaHD21blZpGdkyQVoZ6cIs0fkIdks5vh1nYgaJpZM4c0J0j.

zstumgoren commented 5 years ago

Aha. Sorry, missed that. Might be worth bumping up the available RAM on the VM if you can spare it. If memory serves, I think you can do that in one of the VM states (e.g. when it's powered down), though I've only used VBox. Not sure if that's what you're using, so YMMV...

ducoquelicot commented 5 years ago

I checked the requirements for Kibana/Elasticsearch and it mentioned you should ideally have 64GB of RAM, or at least 16. My pc has 8 - and I can't afford to let the VM take over my whole RAM....

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer


Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:31:57 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Aha. Sorry, missed that. Might be worth bumping up the available RAM on the VM if you can spare it. If memory serves, I think you can do that in one of the VM states (e.g. when it's powered down), though I've only used VBox. Not sure if that's what you're using, so YMMV...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484187622, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTIOuO9NfQ2S575G6OH2i9ElC4dRrks5vh1qNgaJpZM4c0J0j.

ducoquelicot commented 5 years ago

Yeah, I think I mentioned the docs need to be in both databases in my documentation - but I admit it's a lot to read through. Anyway, let's discuss today!

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer


Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:27:39 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Hey, Nice documentation of your thought/learning process ;)

So we can whiteboard/discuss at scrum, but I think your "alternative solution" touches on the sanest path:

As a next step, we should try to hunt down a Python library that makes it easy to perform CRUD operations on ElasticSearch documents. These operations will be integrated into various parts of your application (e.g. data loaders will need to create/update, while the Flask app will need "read" or search capabilities).

Also, let's plan to stick with the most basic Elastic setup possible for development purposes (a single-node cluster) on your local machine. When the prototype is complete, if you choose to put it in production, we can use one of the managed services from AWS or GCP. These services cost money, however, so you may want to ultimately leave this as an open-source prototype that others can build on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484185996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTC4hj342NYAn_58Hh2LqZD88H0vPks5vh1mLgaJpZM4c0J0j.

zstumgoren commented 5 years ago

You shouldn't need 64GB of ram to run a single-node cluster locally. Those higher recommendations are ideal for production use, you should be able to run a small local instance in a 4GB VM where you devote half the RAM (2GB) to Elastic, per these docs:

https://www.elastic.co/guide/en/elasticsearch/reference/7.0/heap-size.html

That would mean, of course, bumping up the VM to 4GB if it's not already there. If that's not an option, we'll figure something else out. Let's talk this afternoon.

zstumgoren commented 5 years ago

One other note -- you've probably seen this already, but just in case, the Flask Mega Tutorial chapter on search shows in great detail how to integrate Elastic with Flask ORM models. We can use this as a baseline for integration, but before committing to his custom approach, we should do some searching for possible Python libraries that make it easy to integrate ES with SQLAlchemy models. Similar libraries exist in Ruby, and wouldn't be surprised if they exist in Python as well.

ducoquelicot commented 5 years ago

Yeah I was studying that intently yesterday, my workflow basically comes from adapting what he does - although admittedly I didn't grasp everything he did so would be great to discuss.

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer


Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:58:57 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

One other note -- you've probably seen this already, but just in case, the Flask Mega Tutorial chapter on search shows in great detail how to integrate Elastic with Flask ORM models. We can use this as a baseline for integration, but before committing to his custom approach, we should do some searching for possible Python libraries that make it easy to integrate ES with SQLAlchemy models. Similar libraries exist in Ruby, and wouldn't be surprised if they exist in Python as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484197647, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTIkyPrw10aIfD-OB6xeQn06xK7Omks5vh2DhgaJpZM4c0J0j.

ducoquelicot commented 5 years ago

It worked! I changed the available RAM for my VM to 4 gig and reinstalled Kibana and Elasticsearch. all is well now - I can be such a black hat sometimes ;p

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer


Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:52:27 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

You shouldn't need 64GB of ram to run a single-node cluster locally. Those higher recommendations are ideal for production use, you should be able to run a small local instance in a 4GB VM where you devote half the RAM (2GB) to Elastic, per these docs:

https://www.elastic.co/guide/en/elasticsearch/reference/7.0/heap-size.html

That would mean, of course, bumping up the VM to 4GB if it's not already there. If that's not an option, we'll figure something else out. Let's talk this afternoon.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484195117, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTFLOZR9crVaUJmBCmAI3mWiPK6Btks5vh19bgaJpZM4c0J0j.