Evaluate extractor - Githubissues

boun-tabi-LMG / turkish-academic-text-harvest

MIT License

2 stars 0 forks source link

Closed furkanakkurt1335 closed 12 months ago

furkanakkurt1335 commented 1 year ago

After all the steps, we need to finalize the extractor script by evaluating it on several outputs before starting it on all the PDFs.

furkanakkurt1335 commented 1 year ago

2 points I have right now for the script output:

Table of contents is not removed.
Sometimes, Turkish characters (e.g. ş) are decoded wrongly. For example ş is Ģ in the thesis 782470. We can gather all the wrong decodings and make a dictionary out of them to use str.replace if found in a PDF.

furkanakkurt1335 commented 1 year ago

started a dictionary by e7a529b.

furkanakkurt1335 commented 12 months ago

@zeynepyirmibes had handled the above-mentioned dictionary with replacement_dict in /normalize.py.

extractor.py had been used for yok-tez and dergipark. We were happy with the outputs of the script at the end.