ContentMine / norma

Convert XML/SVG/PDF into normalised, sectioned, scholarly HTML
Apache License 2.0
36 stars 21 forks source link

Regex file selection duplicates OS globbing/regex functionality #67

Open ghost opened 7 years ago

ghost commented 7 years ago

This issue report is based on using this JAR.

The regex file selection functionality in norma adds complexity without adding value.

That is because this functionality duplicates what is already offered by shells on modern operating systems, such as GNU Bash or Windows PowerShell. Such shells already feature globbing- and regular expression-based file selection.

For example:

$ java -jar norma-0.5.0-SNAPSHOT-jar-with-dependencies.jar --project publicPapers  --fileFilter '.*/(.*).pdf' --makeProject '(\1)/fulltext.pdf'

could more naturally be expressed in Bash with a glob like:

$ java -jar norma-0.5.0-SNAPSHOT-jar-with-dependencies.jar --infiles publicPapers/*.pdf --makeProject 'fulltext.pdf'

If one's matching criteria require a regex rather than just a glob, this is also available with standard OS tools:

$ java -jar norma-0.5.0-SNAPSHOT-jar-with-dependencies.jar --infiles "$(find publicPapers -maxdepth 1 -type f -iregex '.*/\(pub\|phm\).*.pdf')" --makeProject 'fulltext.pdf'

Removing regex CLI functionality from norma would provide the following benefits:

petermr commented 7 years ago

This is misconceived and unnecessary. This is a regex and not a glob and the capture groups re used to rename parts of the tree

ghost commented 7 years ago

@petermr wrote:

This is a regex and not a glob

Strictly speaking, that is true, and I have amended the wording accordingly. However, a glob would be adequate in many cases, including the invocations described at http://discuss.contentmine.org/t/extracting-data-from-tilburg-funnel-plot-diagrams/386 , and I have now illustrated that in my opening comment above.

and the capture groups re used to rename parts of the tree

In all the examples norma invocation I have so far observed, this would be better handled as described in my opening comment.

This is misconceived and unnecessary.

Surely these benefits, stated in my opening comment, are neither misconceived, nor undesirable:

- simpler documentation - easier learning curve - reduced code complexity - easier maintenance.

Therefore re-opening, as unresolved.

ghost commented 7 years ago

Related: https://github.com/ContentMine/cproject/issues/3