Open Hitansh-Shah opened 2 years ago
Also I had a doubt about the dockerfile. If I am not wrong the container made by the dockerfile contains the cloned git repo of sequencer as while building the image it performs a fresh git clone. So while developing we cannot use the docker because the local changes will not be reflected. I don't have much experience with docker so I may have deviated to the wrong direction. Please correct me if I am mistaken.
So while developing we cannot use the docker because the local changes will not be reflected.
This is true. I'm open to to updating this to allow easier local development. A workaround would be pushing your changes to a development branch and changing the dockerfile to clone that branch instead of master. You could also just use a local environment for development instead of a container.
You're right, I'd remove https://github.com/Open-Book-Genome-Project/sequencer/blob/master/Dockerfile#L3 and then within volumes
add https://github.com/Open-Book-Genome-Project/sequencer/blob/master/docker-compose.yml#L6
volumes:
- ./:/sequencer
You could also just use a local environment for development instead of a container.
Yup, a virtual env seems to do the work.
I'll submit a new PR for the docker fixes :) This PR seems like it's in the right direction. There may be opportunities for us to tune it to increase accuracy. e.g. How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.
How might we prevent false positives -- books which may mention the words "table of contents" which don't actually have a table of contents.
I gave it a little thought. The Toc page will always be placed before the main content. So we actually don't have to scan the whole book. If somehow we can manage to set a limit for the for loop to break, we should be good to go.
If somehow we can manage to set a limit for the for loop to break, we should be good to go.
Doesn't the module's super().__init__(match_limit=1)
do this here?
KeywordPageDetectorModule
will break once match limit is reached:
https://github.com/Open-Book-Genome-Project/sequencer/blob/f6f6f8657fbfcfc2c675c154765339dfd5d5336c/bgp/modules/terms.py#L321
@finnless That's totally correct. But in this the case where there is no table of contents page and "table of contents" /"contents" is mentioned somewhere in the book will also be detected.
As far as I know we can avoid this by 2 methods.
1) If we found the page we can add further validation before appending it to self.matched_pages
.
2) If table of contents is present it will always be before the main content. So even if the match_limit
is not reached we can break the loop if we figure out that we have entered in the main content section and from here there is no point in iterating further.
I may have missed or misinterpreted something, so please correct me if I am going in the wrong direction.
This seems like the right line of thinking. What other data on the page may enable us to detect table of contents pages? Also what about the books that use the word contents instead of table of contents? Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?
Also, could we use the book page image? https://www.researchgate.net/publication/4232729_Detection_and_Segmentation_of_Table_of_Contents_and_Index_Pages_from_Document_Images
Could we build a simple classifier which bounds accuracy? https://arxiv.org/pdf/1306.4631
Also what about the books that use the word contents instead of table of contents?
I guess we can pass multiple keywords in the module. Like for the copyright page there are copyright, ©
.
Do you think table of contents is usually one of the first things on the page? Are there other terms like glossary which frequently show up?
I guess it can really vary from book to book. We can't say for sure. Also how do we define "first" because there maybe book where the heading can be vertically written like in the example I shared on slack.
@mekarpeles I found something interesting today. In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages. It is open source so we can look at the source code. I will see if I can find something useful from it. I am attaching a screenshot from the Document viewer application. I will also take a look at the resources you attached.
In GNOME the Document Viewer application automatically creates a sidebar table of contents with links to those pages.
My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.
My guess would be in this case the contents metadata is included in the PDF itself rather than being derived by Document Viewer.
I guess you are right. Because I can get the same sidebar in chrome too. My bad😅
So I read the articles @mekarpeles attached. Both of them mainly focused on the characteristics of Toc. One of them had a more statistical approach which is a bit complex to identify the Toc. And the other had a relatively simple approach. The main idea I got is that it may not be very accurate to just iterate through pages and look for the keywords
passed in the module. Rather we may have to scan the whole page for a pattern (For eg: if a structure is there consisting of titles with bold font and occasionally starting with numbers which maybe section numbers like 3.18 etc.) and then classify it into either toc or non-toc page.
I don't know if we should implement ml or there are other ways which without ml. As of now I hardly have any knowledge of ml but if we are to implement ml into these I don't think it will be very advanced so I could learn the concepts while implementing them or atleast I will try.
@mekarpeles @finnless . I have made some changes in the TocPageDetectionModule
. We can avoid almost all the cases where "contents" might be detected somewhere else in the book by simply checking if it is the only word in the whole line. On toc page it will be present as a header and so as a result the only word in that line. Obviously there will be still a case where "contents" happens to be the only word of the last line of a paragraph. But in this case we can safely assume that there will be some kind of punctuation present with "contents" and as a result comparing it with our keyword
would give False
. I have implemented this in such a way that we can also take care of "table of contents".
Please provide your feedback on this for any improvements or corrections that can be done. After that we can test this on some books.
@Hitansh-Shah I made a few changes, take a look and see what you think and if you have any suggestions. Otherwise, we can try running this on 100 public books and see how it works!
Here's a good set of books to test with https://archive.org/search.php?query=%22table%20of%20contents%22&sin=TXT
@mekarpeles the changes you have made seem perfect to me. I have some minor concerns which I have commented in the respective changes conversation. Other than that I think we are ready to test the first version. :rocket:
Hey @mekarpeles can you help me with the 'search query' for retrieving the items? In the link you shared before for set of books to test on, there is a url parameter called sin=TXT
which basically searches "Text Contents". I don't know how to state this in query because without it, it will search "metadata". Can you please help me with this?
@mekarpeles , In bgp/modules/terms.py, I have added the
class
forTocPageDetectorModule
. It is a simple class copied fromCopyrightPageDetectorModule
. I removed theextractor
function and changed thekeywords
for table of contents. I have also added mysequencer.py, a temporary file in the root of the project for defining a sequencer which only detects Table of contents page.