Juris-M / zotero-odf-scan-plugin

RTF/ODF-Scan for Zotero add-on
https://zotero-odf-scan.github.io/zotero-odf-scan/
Other
87 stars 15 forks source link

Convert to citations/bibliography rather than links #24

Open retorquere opened 5 years ago

retorquere commented 5 years ago

I'm interested in adding the possibility to have the ODF-scanner use the Zotero-embedded citeproc to create a finalized document; not to remove the existing functionality to create a Zotero-compatible document but so that I can use Word-online + ODF scan without requiring the use of Word to finalize the document. Would this be:

and if so, can you point me to the part of the code that does the current replacement?

adam3smith commented 5 years ago

It'd definitely be desirable, but it wouldn't be easy. Currently the thing that makes the tool uncomplicated is that it doesn't need to talk to Zotero at all during the scan. All the citation data is added when setting a citation style in LibreOffice. The scan just converts the markers to LO Reference Marks with Zotero format and Zotero item.uris that allow for updating.

The relevant function is here: https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L271

Hope you like regular expressions ;)

retorquere commented 5 years ago

The talking to Zotero bit isn't really too hard. Is https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L512 the central function that orchestrates the finding and replacing, and https://github.com/Juris-M/zotero-odf-scan-plugin/blob/master/chrome/content/rtfScan.js#L594 the part that does the actual replacements?

If I may ask, why use regexes when FF has XML/XPath functionality built in?

adam3smith commented 5 years ago

I don't think there's a strong reason to use regex over XML except that Frank likes regex (the original tool this is based on was in python I think, but it's not like that would have made using XML/XPATH impossible). Might be that it actually ends up being more stable given different interpretation of the ODF XML model, but also possible that the reverse is true. Certainly worth testing out. That looks right wrt the functions, yes.

fbennett commented 5 years ago

If I may ask, why use regexes when FF has XML/XPath functionality built in?

You're not the first to ask that question. :smiley: The code was originally rejected for inclusion in Zotero for exactly that reason. (Edit: Dan's third response in this thread on zotero-dev)

The problem is that the target string may be cross-nested with XML tags that capture a larger run of document text. Identifying the string and isolating it for replacement using XML methods would be very hard to do. It would also be slower to run (because you would need to iterate to the top of the XML hierarchy to determine that a given match attempt had failed). I offered that explanation at the time, and it didn't find favor, but that's the reason behind using regex there.

retorquere commented 5 years ago

Cross-nested? I thought XML was strictly hierarchical?

retorquere commented 5 years ago

(that link appears to want to search your mailbox -- I don't think I have access to that :smile:)

fbennett commented 5 years ago

Cross-nested? I thought XML was strictly hierarchical?

XML is, but the "scannable cites" are not an XML unit, so you get things like this:

<tag>blah<tag>. { See <tag>e.g. | Smith,</tag></tag> 2008 | | |zu:6204:P4KXGRZI}</tag>

Maybe there is an easy way to find the strings and adjust the tag structure to permit insertion of a well structured XML element at their location in DOM context, but it looked pretty daunting to me, and I gave up.

Didn't notice that the Google Groups links worked that way! Here's the relevant bit (from April 16, 2013):

Frank

If it's firm that regular expressions can't be used, this is probably off the table for mainstream. That approach started as a hack, with the intention of eventually refactoring the code to use an XML parser. But as I played with documents, I found that the string is often chopped up by tag nesting in the internal XML markup. You could probably identify them, but the code would probably be harder to follow than the regexp, and might require quite a few debugging iterations. It's probably not worth attempting.

Dan

Well, it's just the use of regular expressions to actually parse the XML that we object to. Can you not use XPaths to find the relevant nodes, and then do regexps on the textContent?

retorquere commented 5 years ago
<tag>blah<tag>. { See <tag>e.g. | Smith,</tag></tag> 2008 | | |zu:6204:P4KXGRZI}</tag>

Lord Cthulhu almighty, there's kids in the room, you can't just show things like this out in the open... alright, I see your point. The solution would be ugly in any case given this, and the regexen are arguably less ugly than the XML parsing would have been.

Wow.

fbennett commented 5 years ago

It does look like a plain string in the word processor, though, so by adding LibreOffice as a dependency ...

paultroop commented 2 years ago

I'm afraid I do not follow all the technical discussion here, but is this issue linked to the possibility of using the ODF scan as something like a bibtex type referencing system? I'm looking at the idea of using Latex for writing, but all my research references are in the ODF scan form. I was wondering if there is an easy way of converting them into something that would be recognised in Latex.