BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

python-dwca-reader in Jython #43

Closed tucotuco closed 8 years ago

tucotuco commented 8 years ago

Currently the python-dwca-reader has lxml as a requirement. Is there a reason for this? I do not see where it is actually used. The reason I ask is that I would very much like to use the python-dwca-reader with Jython, but the dependency on lxml (which has no implementation that works with Jython, since it is based on C and has not been ported to date) makes this impossible. BeautifulSoup can use other parsers, so I wonder if it is possible to elect the parser rather than require lxml.

tucotuco commented 8 years ago

Oops, looks like lxml is the only parser BeautifulSoup can use

"Right now, the only supported XML parser is lxml. If you don’t have lxml installed, asking for an XML parser won’t give you one, and asking for “lxml” won’t work either."

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use

So the new question becomes, "Would it be possible to have the reader not depend on BeautifulSoup?"

niconoe commented 8 years ago

Hi John,

Indeed, you've perfectly nailed it: python-dwca-reader depends on BeautifulSoup, and BeautifulSoup needs lxml. I've myself been uncomfortable since a long time to have such an heavy dependency for relatively "peripheral" features.

So one of my medium-term plan was to replace BeautifulSoup by something lighter, or at least make it optional. Do you urgently need to use python-dwca-reader? I can in the next few days (let's say a week) find time to evaluate if I can publish a new version that doesn't depend on BeautifulSoup. If not too hard and useful for you, I'd definitely go for it. It's also a good opportunity to test it (and fix it if necessary) on Jython, I don't think it has been done before!

Best,

Nico

tucotuco commented 8 years ago

I am using python-dwca-reader actively, but the Jython context does not have the same urgency as just using the Readers. I thought about forking the repository and making a version that had BeautifulSoup optional, but it would probably take me longer than next week to get around to it. If you can do it that same time frame, that is better. I will gladly test it as soon as it is ready.

niconoe commented 8 years ago

Cool, didn't know you were already using it, happy that my work is useful to others.

I had a quick look, and it seems indeed that it should be possible to make an version of python-dwca-reader that replace BeautifulSoup/lxml by ElementTree from the standard library... If I'm not mistaken, it is also available in Jython, and so we shouldn't be too far from having Jython compatibility... What do you think?

tucotuco commented 8 years ago

I think, "Excellent, go for it." Waiting anxiously.

On Fri, Aug 14, 2015 at 11:16 AM, Nicolas Noé notifications@github.com wrote:

Cool, didn't know you were already using it, happy that my work is useful to others.

I had a quick look, and it seems indeed that it should be possible to make an version of python-dwca-reader that replace BeautifulSoup/lxml by ElementTree from the standard library... If I'm not mistaken, it is also available in Jython, and so we shouldn't be too far from having Jython compatibility... What do you think?

— Reply to this email directly or view it on GitHub https://github.com/BelgianBiodiversityPlatform/python-dwca-reader/issues/43#issuecomment-131037853 .

niconoe commented 8 years ago

Hi John,

I just released a new version (0.7.0) that totally drops the dependency to BeautifulSoup and lxml. All the APIs that were returning BeautifulSoup objects now return xml.etree.ElementTree.Element (from the standard library). Could you have a look?

I only checked very briefly, but it seems to work under Jython!

tucotuco commented 8 years ago

Confirmed that this works great under Jython and completely solves the issue for me. Closing. Thank you very much.