Open-Book-Genome-Project / sequencer

A toolchain of tasks for sequencing and fingerprinting book fulltext
https://bookgenomeproject.org
43 stars 14 forks source link

Table of Contents Page detection Draft #82

Open Hitansh-Shah opened 2 years ago

Hitansh-Shah commented 2 years ago

@mekarpeles , In bgp/modules/terms.py, I have added the class for TocPageDetectorModule. It is a simple class copied from CopyrightPageDetectorModule. I removed the extractor function and changed the keywords for table of contents. I have also added mysequencer.py, a temporary file in the root of the project for defining a sequencer which only detects Table of contents page.

Hitansh-Shah commented 2 years ago

Also I had a doubt about the dockerfile. If I am not wrong the container made by the dockerfile contains the cloned git repo of sequencer as while building the image it performs a fresh git clone. So while developing we cannot use the docker because the local changes will not be reflected. I don't have much experience with docker so I may have deviated to the wrong direction. Please correct me if I am mistaken.

finnless commented 2 years ago

So while developing we cannot use the docker because the local changes will not be reflected.

This is true. I'm open to to updating this to allow easier local development. A workaround would be pushing your changes to a development branch and changing the dockerfile to clone that branch instead of master. You could also just use a local environment for development instead of a container.

mekarpeles commented 2 years ago

You're right, I'd remove https://github.com/Open-Book-Genome-Project/sequencer/blob/master/Dockerfile#L3 and then within volumes add https://github.com/Open-Book-Genome-Project/sequencer/blob/master/docker-compose.yml#L6

volumes:
    - ./:/sequencer
Hitansh-Shah commented 2 years ago

You could also just use a local environment for development instead of a container.

Yup, a virtual env seems to do the work.

mekarpeles commented 2 years ago

I'll submit a new PR for the docker fixes :) This PR seems like it's in the right direction. There may be opportunities for us to tune it to increase accuracy. e.g. How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

Hitansh-Shah commented 2 years ago

How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.

I gave it a little thought. The Toc page will always be placed before the main content. So we actually don't have to scan the whole book. If somehow we can manage to set a limit for the for loop to break, we should be good to go.

finnless commented 2 years ago

If somehow we can manage to set a limit for the for loop to break, we should be good to go.

Doesn't the module's super().__init__(match_limit=1) do this here?

KeywordPageDetectorModule will break once match limit is reached: https://github.com/Open-Book-Genome-Project/sequencer/blob/f6f6f8657fbfcfc2c675c154765339dfd5d5336c/bgp/modules/terms.py#L321

Hitansh-Shah commented 2 years ago

@finnless That's totally correct. But in this the case where there is no table of contents page and "table of contents" /"contents" is mentioned somewhere in the book will also be detected.

As far as I know we can avoid this by 2 methods. 1) If we found the page we can add further validation before appending it to self.matched_pages. 2) If table of contents is present it will always be before the main content. So even if the match_limit is not reached we can break the loop if we figure out that we have entered in the main content section and from here there is no point in iterating further.

I may have missed or misinterpreted something, so please correct me if I am going in the wrong direction.

mekarpeles commented 2 years ago

This seems like the right line of thinking. What other data on the page may enable us to detect table of contents pages? Also what about the books that use the word contents instead of table of contents? Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

mekarpeles commented 2 years ago

Also, could we use the book page image? https://www.researchgate.net/publication/4232729_Detection_and_Segmentation_of_Table_of_Contents_and_Index_Pages_from_Document_Images

Could we build a simple classifier which bounds accuracy? https://arxiv.org/pdf/1306.4631

Hitansh-Shah commented 2 years ago

Also what about the books that use the word contents instead of table of contents?

I guess we can pass multiple keywords in the module. Like for the copyright page there are copyright, ©.

Hitansh-Shah commented 2 years ago

Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?

I guess it can really vary from book to book. We can't say for sure. Also how do we define "first" because there maybe book where the heading can be vertically written like in the example I shared on slack. image

Hitansh-Shah commented 2 years ago

@mekarpeles I found something interesting today. In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages. It is open source so we can look at the source code. I will see if I can find something useful from it. I am attaching a screenshot from the Document viewer application. I will also take a look at the resources you attached. image

finnless commented 2 years ago

In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages.

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

Hitansh-Shah commented 2 years ago

My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.

I guess you are right. Because I can get the same sidebar in chrome too. My bad😅

Hitansh-Shah commented 2 years ago

So I read the articles @mekarpeles attached. Both of them mainly focused on the characteristics of Toc. One of them had a more statistical approach which is a bit complex to identify the Toc. And the other had a relatively simple approach. The main idea I got is that it may not be very accurate to just iterate through pages and look for the keywords passed in the module. Rather we may have to scan the whole page for a pattern (For eg: if a structure is there consisting of titles with bold font and occasionally starting with numbers which maybe section numbers like 3.18 etc.) and then classify it into either toc or non-toc page.

I don't know if we should implement ml or there are other ways which without ml. As of now I hardly have any knowledge of ml but if we are to implement ml into these I don't think it will be very advanced so I could learn the concepts while implementing them or atleast I will try.

Hitansh-Shah commented 2 years ago

@mekarpeles @finnless . I have made some changes in the TocPageDetectionModule. We can avoid almost all the cases where "contents" might be detected somewhere else in the book by simply checking if it is the only word in the whole line. On toc page it will be present as a header and so as a result the only word in that line. Obviously there will be still a case where "contents" happens to be the only word of the last line of a paragraph. But in this case we can safely assume that there will be some kind of punctuation present with "contents" and as a result comparing it with our keyword would give False. I have implemented this in such a way that we can also take care of "table of contents".

Please provide your feedback on this for any improvements or corrections that can be done. After that we can test this on some books.

mekarpeles commented 2 years ago

@Hitansh-Shah I made a few changes, take a look and see what you think and if you have any suggestions. Otherwise, we can try running this on 100 public books and see how it works!

Here's a good set of books to test with https://archive.org/search.php?query=%22table%20of%20contents%22&sin=TXT

Hitansh-Shah commented 2 years ago

@mekarpeles the changes you have made seem perfect to me. I have some minor concerns which I have commented in the respective changes conversation. Other than that I think we are ready to test the first version. :rocket:

Hitansh-Shah commented 2 years ago

Hey @mekarpeles can you help me with the 'search query' for retrieving the items? In the link you shared before for set of books to test on, there is a url parameter called sin=TXT which basically searches "Text Contents". I don't know how to state this in query because without it, it will search "metadata". Can you please help me with this?