Provide a better confidence for header analysis

KnowledgeCaptureAndDiscovery / somef

SOftware Metadata Extraction Framework: A tool for automatically extracting relevant software information from readme files

MIT License

44 stars 22 forks source link

Provide a better confidence for header analysis #138

Open dgarijo opened 3 years ago

dgarijo commented 3 years ago

Right now the header analysis gives a 1 whenever a keyword is detected in the title of a header. Although this behaves in general ok, there are some exceptions. Header analysis should return an estimation on how good the fit is in the category. For example "browser issues (FAQ)" will be tagged as "issue" category, and that may be wrong. Long headers may not be very informative.

dgarijo commented 3 years ago

Two different solutions: 1) Provide a confidence value based on the length of the header (longer headers have less confidence 2) If more than two categories are provided, and these are far in meaning, then lower the confidence.

Alternatively, we should explore using language models to retrieve the meaning of the header more accurately