learn about ElasticSearch

ducoquelicot commented 5 years ago

First step: figure out how to get the pdfs in the elasticsearch database. [Question: do we need to put the pdfs in the SQLite database that I have created as well?]

Solution: use Ingest Attachment in combination with ElasticSearch. See https://beenje.github.io/blog/posts/parsing-and-indexing-pdf-in-python/ for an explanation of how to use Ingest Attachment (works on Apache Tika). I can write a script that does this.

Question: which properties do we want to include in the indexing? Choose from content, title, author, keywords, date, content_type, content_length, language.

Other solution: using the script I created last quarter to convert pdfs to text and write a script in the Flask app that imports the existing PDFs into the SQLite framework that we're also using for the users. Plus of this is a) I already have the script, which means there's less figuring out to do and b) it'll be accessible through the SQLalchemy framework, which means that we can use db.Models and all that jazz that I've learned to use. It would also make it easier, I think, to show the content of the database on the website. It would also make it easier to connect the user who made the query to the actual query and as such, would probably make the 'subscription' process easier as well.

Another pro is that it would save me the step of having to create an ElasticSearch cluster with ingest nodes, because I find all that shit v confusing.

Possible con: how are we going to add metadata?

Possible solution to con: create a FlaskForm or something similar to get the documents in the SQLite database with different elements - such as the city, doctype, etc - and then use that to index the document. Thinking of a form with

index='palo_alto' << city name doctype='agenda' << document type body ='body' << document content in .txt format

workflow

I have a custom scraper for city x, doctype y. Should store documents in where the other functions can reach it. Scraper runs x time per day/week/month. [this can be done last] [Is there a way to only scrape the newest documents? Probably. Might be an element of the scripts that checks whether the docname already exists and then continues if it does]
I have a model/form to add the documents to the SQLite database. Attributes would be id, index, doctype, body. Either have custom models for all scrapers, or find a way to make sure that the index and doctype values can be gotten from the document when it is parsed through automatically [for instance if documents are stored in a folder called 'palo_alto' with a name like 'agenda_ymd', it could get the index-name from the folder name and the doctype from the file name]
I have a function that adds documents to the SQLite database after they've been scraped. These scripts could be combined in a pipeline.py
I have a function that indexes documents as soon as they are added to the SQLite database. Could look at the Flask tutorial for that. Said function should get the index, id, doctype and body from the database

When all the documents are both in the SQLite and the ElasticSearch database, a search function will be needed.

I'm taking all this as a variation on the way the Flask tutorial on ElasticSearch works. I don't know if there's an easier solution, especially because I think having the documents in the SQLite database is going to prove useful for the reasons listed above.

ducoquelicot commented 5 years ago

Indexing data from binary uses a lot of resources, so elastic recommends using an ingest node dedicated to this pipeline.

A node that has node.ingest set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as node.ingest: false.

ducoquelicot commented 5 years ago

Step two: create an Elasticsearch cluster with ingest nodes that will be used and run the script against this cluster.

Questions: how do I create a cluster? How do I create nodes? I tried to Google these answers but I find everything highly, highly confusing.

ducoquelicot commented 5 years ago

Step three: once all the pdfs I currently have are indexed, perform several keyword searches to check how it works.

Note: I cannot use Kibana because my VM will literally freeze to death every time I open ElasticSearch with the X-packages involved, and I need to have those in order to use Kibana. Unfortunately.

ducoquelicot commented 5 years ago

Step four: update the search page so that the output of the search page will be parsed into a function that executes the search.

Understand the search query mechanism used by elasticsearch-dsl
Write a function that executes a search
Rewrite the search page into a FlaskForm form
Connect the output from the search form into the search function

ducoquelicot commented 5 years ago

Step five: run a test on the search form to check whether it works.

ducoquelicot commented 5 years ago

Step six: do a pyTest unit test on the search function to reconfirm that it works.

ducoquelicot commented 5 years ago

Step seven: figure out a way for people to subscribe to a specific search.

Ideas:

Let people perform a search first, then provide the option to subscribe to it
Let people subscribe to a search from their profile page [would need similar form as 'search' but added in a question about the frequency of updates]
Or both

Would also need to figure out a way to connect the search with the user email and create a complete package with search query + frequency + email that would then be used to execute the search at specific intervals and then sends the update to the user [this is going to be a separate issue]

ducoquelicot commented 5 years ago

Optional step eight: figure out a way to show the search results for their active subscriptions on the user's profile page.

zstumgoren commented 5 years ago

Hey, Nice documentation of your thought/learning process ;)

So we can whiteboard/discuss at scrum, but I think your "alternative solution" touches on the sanest path:

Extract text from PDF (this may require pure Python/pdftotext type tool or OCR if the document is scanned)
Insert the extracted text into elasticsearch along with metadata about the documents. I agree that you should lean on the SQLAlchemy ORM. The database layer should store things such as document metadata (e.g. name of doc, city, and possibly even the extracted text of the document). But to get the benefits of real search, you'll need to also "index" some or all of this metadata in ElasticSearch (along with the text of the document, of course). This can feel a little duplicative, but there are good reasons for including at least some of the metadata in both places (we can discuss those during scrum).

As a next step, we should try to hunt down a Python library that makes it easy to perform CRUD operations on ElasticSearch documents. These operations will be integrated into various parts of your application (e.g. data loaders will need to create/update, while the Flask app will need "read" or search capabilities).

Also, let's plan to stick with the most basic Elastic setup possible for development purposes (a single-node cluster) on your local machine. When the prototype is complete, if you choose to put it in production, we can use one of the managed services from AWS or GCP. These services cost money, however, so you may want to ultimately leave this as an open-source prototype that others can build on.

zstumgoren commented 5 years ago

Oh, I would follow these instructions for a local install: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html#_installation_example_on_linux

You should also install Kibana.

ducoquelicot commented 5 years ago

I installed Kibana but as I mentioned, it crashes on me. My VM can hardly handle ElasticSearch as is.

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer

Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:28:56 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Oh, I would follow these instructions for a local install: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html#_installation_example_on_linux

You should also install Kibana.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484186476, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTKaHD21blZpGdkyQVoZ6cIs0fkIdks5vh1nYgaJpZM4c0J0j.

zstumgoren commented 5 years ago

Aha. Sorry, missed that. Might be worth bumping up the available RAM on the VM if you can spare it. If memory serves, I think you can do that in one of the VM states (e.g. when it's powered down), though I've only used VBox. Not sure if that's what you're using, so YMMV...

ducoquelicot commented 5 years ago

I checked the requirements for Kibana/Elasticsearch and it mentioned you should ideally have 64GB of RAM, or at least 16. My pc has 8 - and I can't afford to let the VM take over my whole RAM....

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer

Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:31:57 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Aha. Sorry, missed that. Might be worth bumping up the available RAM on the VM if you can spare it. If memory serves, I think you can do that in one of the VM states (e.g. when it's powered down), though I've only used VBox. Not sure if that's what you're using, so YMMV...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484187622, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTIOuO9NfQ2S575G6OH2i9ElC4dRrks5vh1qNgaJpZM4c0J0j.

ducoquelicot commented 5 years ago

Yeah, I think I mentioned the docs need to be in both databases in my documentation - but I admit it's a lot to read through. Anyway, let's discuss today!

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer

Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:27:39 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

Hey, Nice documentation of your thought/learning process ;)

So we can whiteboard/discuss at scrum, but I think your "alternative solution" touches on the sanest path:

Extract text from PDF (this may require pure Python/pdftotext type tool or OCR if the document is scanned)
Insert the extracted text into elasticsearch along with metadata about the documents. I agree that you should lean on the SQLAlchemy ORM. The database layer should store things such as document metadata (e.g. name of doc, city, and possibly even the extracted text of the document). But to get the benefits of real search, you'll need to also "index" some or all of this metadata in ElasticSearch (along with the text of the document, of course). This can feel a little duplicative, but there are good reasons for including at least some of the metadata in both places (we can discuss those during scrum).

As a next step, we should try to hunt down a Python library that makes it easy to perform CRUD operations on ElasticSearch documents. These operations will be integrated into various parts of your application (e.g. data loaders will need to create/update, while the Flask app will need "read" or search capabilities).

Also, let's plan to stick with the most basic Elastic setup possible for development purposes (a single-node cluster) on your local machine. When the prototype is complete, if you choose to put it in production, we can use one of the managed services from AWS or GCP. These services cost money, however, so you may want to ultimately leave this as an open-source prototype that others can build on.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484185996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTC4hj342NYAn_58Hh2LqZD88H0vPks5vh1mLgaJpZM4c0J0j.

zstumgoren commented 5 years ago

You shouldn't need 64GB of ram to run a single-node cluster locally. Those higher recommendations are ideal for production use, you should be able to run a small local instance in a 4GB VM where you devote half the RAM (2GB) to Elastic, per these docs:

https://www.elastic.co/guide/en/elasticsearch/reference/7.0/heap-size.html

That would mean, of course, bumping up the VM to 4GB if it's not already there. If that's not an option, we'll figure something else out. Let's talk this afternoon.

zstumgoren commented 5 years ago

One other note -- you've probably seen this already, but just in case, the Flask Mega Tutorial chapter on search shows in great detail how to integrate Elastic with Flask ORM models. We can use this as a baseline for integration, but before committing to his custom approach, we should do some searching for possible Python libraries that make it easy to integrate ES with SQLAlchemy models. Similar libraries exist in Ruby, and wouldn't be surprised if they exist in Python as well.

ducoquelicot commented 5 years ago

Yeah I was studying that intently yesterday, my workflow basically comes from adapting what he does - although admittedly I didn't grasp everything he did so would be great to discuss.

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer

Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:58:57 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

One other note -- you've probably seen this already, but just in case, the Flask Mega Tutorial chapter on search shows in great detail how to integrate Elastic with Flask ORM models. We can use this as a baseline for integration, but before committing to his custom approach, we should do some searching for possible Python libraries that make it easy to integrate ES with SQLAlchemy models. Similar libraries exist in Ruby, and wouldn't be surprised if they exist in Python as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484197647, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTIkyPrw10aIfD-OB6xeQn06xK7Omks5vh2DhgaJpZM4c0J0j.

ducoquelicot commented 5 years ago

It worked! I changed the available RAM for my VM to 4 gig and reinstalled Kibana and Elasticsearch. all is well now - I can be such a black hat sometimes ;p

Fabienne Meijer

650-334-7793

fmeijer@stanford.edu

@fabienne_meijer

Van: Serdar Tumgoren notifications@github.com Verzonden: woensdag 17 april 2019 10:52:27 Aan: ducoquelicot/observ_admin CC: Fabienne Rosina Nicole Meijer; Author Onderwerp: Re: [ducoquelicot/observ_admin] learn about ElasticSearch (#5)

You shouldn't need 64GB of ram to run a single-node cluster locally. Those higher recommendations are ideal for production use, you should be able to run a small local instance in a 4GB VM where you devote half the RAM (2GB) to Elastic, per these docs:

https://www.elastic.co/guide/en/elasticsearch/reference/7.0/heap-size.html

That would mean, of course, bumping up the VM to 4GB if it's not already there. If that's not an option, we'll figure something else out. Let's talk this afternoon.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/ducoquelicot/observ_admin/issues/5#issuecomment-484195117, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AtCfTFLOZR9crVaUJmBCmAI3mWiPK6Btks5vh19bgaJpZM4c0J0j.

ducoquelicot / observ_admin

learn about ElasticSearch #5