Using the library to extract text from pdf while maintaining encoding

LBeaudoux / iso639

A fast, simple ISO 639 library.

MIT License

33 stars 5 forks source link

Using the library to extract text from pdf while maintaining encoding #15

Closed billmdevs closed 1 year ago

billmdevs commented 1 year ago

Hello @LBeaudoux ,

Thank you for the awesome library! I may have missed from your description and documentation but is it possible to use the library to extract text from a pdf in an iso639-3 compatible language and maintain the encoding?

LBeaudoux commented 1 year ago

I'm not sure I understand your question. The constructor of the Lang class accepts values of type str as arguments, and raises an error otherwise:

>>> from iso639 import Lang
>>> lang_name = "English"
>>> Lang(lang_name.encode("utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/lbdx/iso639/iso639/iso639.py", line 68, in __new__
    raise InvalidLanguageValue(*args, **kwargs)
iso639.exceptions.InvalidLanguageValue: (*(b'English',), **{}). Only valid ISO 639 language values are supported as arguments.

Is that what you wanted to know?

billmdevs commented 1 year ago

Thank you for your response @LBeaudoux!

To be clearer; the problem I am trying to solve is;

I want to do is extract text from a pdf written in French(language code: fre or fra) and in Ghomala(language code: bbj) but I couldn't see how to do it using your library as the docs didn't mention something in that aspect. What I am asking is how do I do that with this iso639 library? As I said in my previous message I may have missed something.

LBeaudoux commented 1 year ago

Maybe there is a misunderstanding. This library is not for extracting text from PDF files, but only helps you to handle the ISO 639 series of international standards for language codes.

billmdevs commented 1 year ago

I see! There is a misunderstanding there! When you say it helps to "handle" I am not completely sure what it means. In the case of extracting there are many other libraries I can use but in this library how can I handle the language code.

I think I am not understanding the purpose/how to use your library well.

LBeaudoux commented 1 year ago

Basically, this library maps together the different codes and names of the ISO 639 standard. For example, you can get the name of a language from its ISO 639-3 code:

>>> from iso639 import Lang
>>> lg = Lang("eng")
>>> lg
Lang(name='English', pt1='en', pt2b='eng', pt2t='eng', pt3='eng', pt5='')
>>> lg.name
'English'

You can find examples in the README.md file.

billmdevs commented 1 year ago

Thank you very much for your help!