WilliamPham1602 / Document-Splitting

UvA Final Thesis
1 stars 0 forks source link

Research Question help! #5

Open WilliamPham1602 opened 2 years ago

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

I just finished the introduction and research question. For this project, i was quite confusing about the research question. Can you help me to have a quick look at it?

The report link: https://www.overleaf.com/read/kypgyxfspjdv

Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602

Thanks for reaching out. I miss in your text the reference that the “Dutch documents” are policy documents that have been released in the context of the Freedom of Information Act. This is highly relevant, because the splitting task is only needed due to the fact that documents are unnecessarily released as 1 document based on politically motivated reasons.

The reference to the Freedom of Information Act (WOB / since 1 May WOO) is also missing from your main research question. It would be good to make explicit the “NLP techniques” you will be focusing on. NLP is too broad. Why do you mention your task as subtracting sub-documents? Is your task not to un-concatenate the “original” policy documents? Best for your main research question is when you mention first the class of techniques you see also most promising and then also mention the baseline. What is your baseline? It is not clear to me yet.

As to the sub-questions, you will need to reformulate. Take the second sub-question, it sounds now like a theoretical question. But you can just make it super concrete and make it into a comparison of two or three machine learning models. Please specify. The same goes for the third question: it is too broad to ask which NLP technique is most suited for this corpus. What is the difference in your case between NLP technique and machine learning model? The first sub-question is concrete. But you can just write: to what extent can Doc2Vec convert images [more effectively] than TF-IDF? Will this be the only comparison you do? Or are there also other convert possibilities you compare? The key is to make clear in the question how you measure effectiveness. What metric?

In a general sense, it is good to be aware that Maarten wants three sub-questions for the main question and then 3 sub-sub-questions per sub-question. I hope my input helps you to approach that format.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

I appreciate your feedback. I will update the document in this week and let you know.

Regards, Sang.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

I did update the introduction. Can you help me to have a quick look on it and the Research question?

Report link

If there is any insufficient part, please let me know.

Thanks a lot for your support. Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602

Thanks for your draft. Your abstract is concrete, but it misses the connection to previous studies. It is centred on your research but it does not contextualize it. What is the history of splitting documents? Why is your method a promising path to take? You want to take the scientific relevance in. See this link for the abstract guidelines: https://canvas.uva.nl/courses/6056/files/5852546?module_item_id=1147364

As to the introduction, you start with a non-informative sentence: “Documents have been playing (…)”. This can be skipped. I would rather start with positioning that the focus has traditionally been more on other topics than splitting in NLP. Then you can say something about the circumstances in which splitting is necessary. Documents are released as 1 PDF to conform to the Freedom of Information Act in the Netherlands, but this is done in a non-constructive way. You do not say anything about the context of your documents (WOB). When you add this, you get a better idea why it is challenging to do the splitting and why it is a relevant task to do.

Another thing that is missing in your introduction is a summary of what has been done before. You immediately start with stating that you introduce a simple but powerful method. Before you do this, you will need to state what has been done before (or what would be possibly routes to take based on the literature). That will also back your scientific relevance, since you can then justify your method in the light of previous research. When I read the current text, you just state that you use a NLP technique. Please be more specific than that. Name the technique. It is your link to the literature.

In your research question, I like the concreteness. What is missing is the comparison between NLP techniques. Please label one as baseline and another as the focus of your research. You then get: to what extent does technique A perform better than technique B? Another thing is that WOB (Freedom of Information Act) is not mentioned. It is quite different than the “Dutch language”. As to your sub-questions, the second one is a literature question. Please reformulate to a set of supervised models that you compare on performance with a set of unsupervised models. The same goes for the third sub-question. Be aware that you do not conform to Maarten’s preferred tree structure with sub-sub questions. I would not mind, but Maarten is our supervisor and examiner in the end.

In the ideal setup, your introduction is a page long (or a bit more). See this thesis example as a reference point: https://canvas.uva.nl/courses/6056/files/1722547?wrap=1

WilliamPham1602 commented 2 years ago

HI @lestervanderpluijm ,

I will fix the report following your feedback now. So thanks for your example, because I am not sure about Maarten's preferred sub-questions style.

Thanks a lot for your help!

Regards, Sang.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

I have a question regarding the WOO. As I can see the Freedom of Infomation Act (US), but the Netherlands has the Open Government Act (WOO). Should I prefer the Freedom of information Act or the Open Government Act?

Source: https://business.gov.nl/regulation/freedom-of-information/

Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602 Please refer to the Open Government Act. This law came into effect in May. Before May, the WOB was in place. Some of your documents might still fall under the old law. The difference is that via WOB citizens can ask for information, while with WOO the government actively shares information. The problem of having to split might be less with WOO, but we will have to see whether that is true.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

Thanks again for your support. I already finished the Abstract and Introduction, which is followed your feedback. Could you help me to check it?

The report link.

Can we schedule a quick call to discuss the research question (Maarten’s preferred tree structure), because I remember that there is one main research question with three sub-questions and three sub-questions for each of the sub-questions?

Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602

All the content that is needed is now in your introduction. Well done. It can still be improved though on structure. especially the paragraph "On the other hand, the other way...". It would be strong to mention immediately at the start of the third paragraph why "it is still a challenging topic". When you mention "lots of new techniques", I would also like to see it written out a bit more precise. You can skip the "many" in "many NLP tasks". A word like many adds unnecessary subjectivity to your text. The same goes for "much research". I feel that you can gain by making it more concrete.

This will also enable you to include 1-2 sentences on the novelty of what you are doing in comparison to previous research in the abstract. Is combining Computer Vision and Deep Learning Techniques into 1 pipeline novel? To say that you first have to state what would be "normal". Another thing is that your abstract actually starts at "This paper presents...". The sentences that come before are more explanations that fit into the introduction (you can move them there) but they are not needed in the abstract. You want to keep the abstract short and swift. It is nice that you mention the 80%. Can you say something on how this compares to the baseline or what would be expected based on previous literature?

As to the research questions, it would be fine to have a call. Could you make it on Thursday? Let's say 15:00 on Zoom?

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm ,

I appreciate your feedback. Of course, the introduction and abstract will be fixed following your feedback now. I will add Thursday at 15:00 to my calendar.

Thanks a lot, Sang.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm ,

Thanks for your support! I already finished the draft Methodology and am still improving it. Can you help me to have a quick look at from Abstract, Introduction, Litterature Review and Methodology?

This is the report link : link Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602

It is good to see your work progress. Your title can still be strengthened. You could think of something like: “Combining computer vision and deep learning to split concatenated policy documents”. You can name the context of the WOB in the subtitle.

Abstract. I would be careful with presenting “the solution”. You can rather frame it as that this paper aims “to improve the current solutions to un-concatenate latent documents in a large file”. My suggestion would also be to skip a term of NLP and rather use deep learning. Be careful that you do not use the future tense: “this pipeline will convert”. The research is already done, so it is either the past or present tense. Why do you not include something in the extra on “the current solutions”. What do you aim to improve? A sentence like “The trained and tested data …” is too concrete for an abstract. You would here want to focus on that you use two corpuses based on the Dutch Freedom of Information Act. It is good that you mention the results, but to what baseline (Maarten’s score) do you compare? That would also be something to include here. Under the abstract, you should still note some keywords like: NLP, Deep Learning, Splitting, etc.

Introduction. Be sure to make explicit already in the first paragraph that you refer to the Dutch Freedom of Information Act. You do not make this explicit yet until the third paragraph. If you mention that “it is not an easy task”, you could reformulate to: “the difficulty lies in the lack of explicit labelling of the separate (latent) documents”. As I mentioned before, you should do best to skip the future tense. Instead of “This research will show”, you can say “This research aims”. Please skip “lots of new techniques”. You can write: “New techniques have been made public over the last 5 years.” You want to avoid the use of “a lot”, “most” and “few”. Your description of what has been done before has seriously improved. Still, I find it hard to discern what is the best that is now available. VGG16 for computer vision? And LSTM for NLP? I would start your paragraph with announcing what is best and then give the details you provide now.

Research questions. The main one, I would rewrite to: “To what extent can computer vision and deep learning techniques un-concatenate documents released under the Dutch Freedom of Information Act (WOB) more accurately than the [current standards]?” I am not sure what those standards are, but this could then be LSTM? Most sub-questions are strong. Third sub-question should be grammatically written as: “Do layout techniques …” Fourth sub-question: consider replacing “suitable” for “best performing”. Sixth sub-question: what do you mean with “cost comparision”? Computation time? You can be a bit more explicit here. The first sub-question is unclear to me. The Gold Label has not been mentioned before in the introduction. What makes an “useful label”? Some comments to improve the questions, but they give you a good structure already.

The related work section is quite solid. The first two paragraphs read more like background though. They can be shortened. At the start, I would like to hear a short sentence on what is state of the art at this moment on un-concatenating documents. You just need two sentences on that, but it gives structure to the reader. Based on your structure, it is state of the art to use LSTM/NLP. What I do miss in 2.1 is the concrete results. What kind of scores did they have? Also, I can only find the twentieth reference in there. Where is the rest? You expect several references in a related work section. The title of sub-section 2.2 is too long. Three titles in one title. Make that more concise. Personally, I am a bit lost at 2.2. How does this section relate to 2.1? It would be helpful if you gave some lines on what we have learned from 2.1 and where we are going to with 2.2. Again, there is only 1 reference in 2.2. You made a good effort with the related work. But it needs to be extended and get more structured. It is not complete now. I miss the part in which you also make explicit what the research gap is. Where is research thin? That is the starting point for your own research.

Before you go to experimental setting, it is important to make a short section on methodology. How would you describe your methodology? With the settings you explain what you have done. With the methodology you explain why you have done it. Why do you choose a combination of computer vision and deep learning? There were other possibilities available. Methodology does not have to be long. Still, it should be clear what you “add to the literature” and what makes your approach “promising”. See also tips on the methodology in this document: https://canvas.uva.nl/courses/6056/files/5852546?module_item_id=1147364.

Your experimental settings look good. Grammatically, there are some hiccups. But the quality of what you present is good. Be sure to include references. If you mention that you start from a baseline using logistic regression, I am assuming that you base this on previous work. Or am I mistaken? In case of VGG16, you explain what you have done. It would be even better to say why you used certain settings. Can you give some form of justification? Did you copy the settings from another paper? How did you work them out? The title of 3.2 is too generic. I would rather make this into “Model implementation”. One part is missing is a description of the actual data that you use. You described how you preprocessed it. But how do the corpuses look like? How were they selected? This should go before Data Preprocessing.

Final point. After the Results section, you first need a Discussion sections. Only after the Discussion section follow the Conclusion. I hope you can use my feedback for improving your text. I am happy to see that your text is improving. You can send me a new version when you are ready.

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm ,

Thanks a lot for your support. I am working on the result part, i will fix all the part and send it to you asap.

Have a nice weekend and holiday, Sang

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

I am apologise for fixing the report too long. I am busy to finalise the code and the model, and luckily i already finished all the model. The rest work is only the report. Can you help me to check the current report version, which contains all the part? After the content checking, i will start to fix the grammar.

this is the report link: https://www.overleaf.com/5948359792rndwxymmpbjx

Regards, Sang.

lestervanderpluijm commented 2 years ago

Hi @WilliamPham1602

The thesis is still a bit short. You can extend on these points:

• Adding the actual results in the related work section, preferably as a range. Think of a sentence like: “Extensive research has been done into Named Entity Recognition [15]. Currently, neural networks [7, 3, 13] achieve top performance in NER with F1 scores around .81-.82 as compared to .77 in previous approaches.” • Experimental setup can an extension on the hyperparameter settings. There seems not be enough information to fully reproduce your results. There is no need to give the formulas for the metrics, since they are very common. • You can add a confidence interval in the tables. In that way, you can compare the models in terms of significance. Otherwise, what is your measure for that the one method works better than the other method? • Discussion is too short. It needs to be about a page. Common setup is: 1) interpretation of results: why are they as they are and how does it compare to the literature; 2) limitations and 3) future work. Make it into subsections. Be sure to quantify the outperformance in terms of percentages and significance. Do use references to other studies. • Your conclusion is a bit too long. You can keep the introduction to a minimum. A recap on why it is relevant to do your research, (indirect) research question and what is the state of the art is enough. The focus should be on your results compared to the literature. Then, you want to state what the consequence is of your limitations for your results. Do they limit the validity of your research answer? You can finish with one or two sentences on how to proceed with your combination of computer vision and deep learning. Ideal length: half a page. • Be sure to check the thesis on removing subjective language like “many” and “a lot”. You do not use it frequently, but it is best to eliminate completely. Instead of “many” you can use “multiple”, or you can just skip the word. • Your extra figures need to be under the header of Appendix. Please change “Data of defence” on your cover page with “Data of submission”. I like your title!

WilliamPham1602 commented 2 years ago

Hi @lestervanderpluijm,

thanks a lot for your feedback. I will fix it after Friday (resit exam).

Have a nice weekend! Sang