Tool for scraping crop data from practicalplants wiki

petteripitkanen commented 5 years ago

Create a tool for scraping crop data from practicalplants wiki. Try to divide it into functions that could also be useful for #80. The output format should be like practicalplants.json, or a symmetrical plain JS file (#70).

Currently practicalplants.json is missing some data for the properties edible part and use, medicinal part and use, material part and use, the tool should handle these correctly.

The tool should be written in JS but it can be based on the Python tool that was originally used for generating practicalplants.json.

l0n3star commented 5 years ago

@petteripitkanen I'll take this one.

l0n3star commented 5 years ago

@petteripitkanen I'm about half way there. I do have one question. I'm going to dump the data in a database. I'm debating between mongodb and postgres. Any thoughts? One plus point with postgres is I can run SQL queries against it. I also don't see an issue with devising a schema.

petteripitkanen commented 5 years ago

The main output format should be a plain JS file, the one that is currently used. I don't see benefit of dumping the data to a database, but having multiple output formats seems okay to me.

l0n3star commented 5 years ago

@petteripitkanen I have the tool ready for use. Note it only does JSON but I'm now adding support for other formats. I just wanted you to get a first look. Any and all feedback welcome. I tried it and the file size is 23M (current one is 17M). I spot checked a few plants and it has all uses. My repo is here: https://github.com/l0n3star/scraper

petteripitkanen commented 5 years ago

Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order).

If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.
It looks like some crops don't include property binomial, for example Rosmarinus officinalis, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title?
Property functions is always a string while it should be an array of objects that have the function property.
It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end.
It would be good to double-check if other known array properties are parsed correctly.
It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before?
It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats.
It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well.

l0n3star commented 5 years ago

Thanks for great feedback !

On Wed, Sep 11, 2019 at 10:44 AM petteripitkanen notifications@github.com wrote:

Thanks, I have tried it, it is looking good. These are my comments for the moment, they are mostly about practical problems that I found, I haven't yet gone so deeply into the code. I think eventually it would be good to integrate this to powerplant code base. Also when the static crop data is updated it is necessary to produce some sort of diff to see easily that there are no regressions (hopefully git diff is clear enough when the crops are sorted to the same order).

If processPlantContent fails for a crop then this crop is not included in the output. There can easily be connection problems during a run, so a retry mechanism is needed.

It looks like some crops don't include property binomial, for example Rosmarinus officinalis https://practicalplants.org/w/api.php?action=query&prop=revisions&rvlimit=1&rvprop=content&format=json&titles=Rosmarinus%20officinalis, but then title seems to contain the binomial name. There could be logic for this: if binomial is not included, fill it with data from title?

Property functions is always a string while it should be an array of objects that have the function property.

It looks like forage should also be an array, currently it is a string, and sometimes it contains }} at the end.

It would be good to double-check if other known array properties are parsed correctly.

It would be good to double-check if there are more properties that are missing data when compared to raw MediaWiki crop data, maybe there are other properties that we have partly missed before?

It'd be clearer to have one async function that fetches the crops and returns an array of JS objects, this way the output step could be separated from the fetching step, and it would be easier to add support for different output formats.

It'd be ok to use a MediaWiki parser library if there is one available that seems suitable, though the parsing logic don't seem to be that complex so I think it is fine to go with fixing the hand-written parser as well.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO3255YSKLYCXHJY5ZZLQJEVABA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PJ4GQ#issuecomment-530488858, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO35BG57DJ7Y2Q4L5UELQJEVABANCNFSM4IMTMGQA .

petteripitkanen commented 5 years ago

I did some related debugging in #97, it seems that raw (practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.

It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.

Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:

edible part and use
medicinal part and use
material part and use
toxic parts
functions
shelter
forage
crops
subspecies
cultivar groups
ungrouped cultivars

l0n3star commented 5 years ago

Thank you. I will take a look.

On Wed, Sep 18, 2019 at 9:05 AM petteripitkanen notifications@github.com wrote:

I did some related debugging in #97 https://github.com/Ecohackerfarm/powerplant/pull/97, it seems that raw ( practicalplants.org) MediaWiki data has two types of properties: strings and arrays of objects. Though sometimes the strings are actually CSV-encoded arrays.

It looks preferable that the parser understands only these two types (raw strings, arrays of objects), and the conversion from CSV to array is done in another step.

Currently all properties that are arrays of objects are incomplete, so for all these properties there is potentially data missing:

edible part and use

medicinal part and use

material part and use

toxic parts

functions

shelter

forage

crops

subspecies

cultivar groups

ungrouped cultivars

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO332W6FFSG45CWVF3DLQKJGTDA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7ASZPA#issuecomment-532753596, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO36VRS2QF7FNHZ7744LQKJGTDANCNFSM4IMTMGQA .

petteripitkanen commented 5 years ago

I have taken a look at available JS parsers, for instance infobox-parser and wtf_wikipedia (this is Wikipedia-specific but it is possible to trick the parser by changing the template name Plant to Infobox element), and it seems that none of these handle nested templates properly.

Actually tackling the general case even for a single template expression seems to require a full wikitext parser, and basic wikitext is not the easiest to parse to start with, and then it is interleaved with HTML and whatnot, so the task of parsing wikitext to a data structure is quite complex. MediaWiki itself doesn't seem to have such a parser, only code to convert wikitext to HTML for display.

Our case is a bit easier as the practicalplants.org data for the Plant template is quite regular with only small amount of HTML and other irregularities. I have preliminary plans for powerplant to be able to use practicalplants.org for automatically populating a local MediaWiki instance, and then synchronizing the local MediaWiki with powerplant, allowing one to edit and browse the crop collection with MediaWiki, so for this I'd like to have a special practicalplants wikitext parser in powerplant.

For this the way to go could be to:

Extend the PracticalplantsCrop dataset in db/practicalplants-data.js to also contain the raw wikitext of the Plant template. This could be done in one PR.
Start writing a parser and test it using the raw wikitext data. There are already some tests written in #97. The parser could start incomplete and be iteratively improved through multiple PRs.
Once the parser passes all tests, update db/practicalplants-data.js with the parsed objects.

l0n3star commented 5 years ago

I'll get started on adding raw wikitext.

petteripitkanen commented 5 years ago

The extended structure could look like this [{ wikitext: String, object: PracticalplantsCrop }, ...], and then there could be two functions in db/practicalplants-data.js, getCrops for getting the parsed objects and getWikitextObjectPairs for getting the whole data structure. With this structure it would be easy to compare wikitext and parsed objects in a diff.

petteripitkanen commented 5 years ago

While there don't seem to be a JS parser available that generates a complete AST of the nested template structure, it might be useful to use the partial parses and fill in details, to significantly ease the remaining parsing process.

Input:

{{Plant
|common=Rosemary
|family=Lamiaceae
|primary image=Rosmarinus officinalis.jpg
|forage={{Plant provides forage for|forage=Bees}}
|edible part and use={{Has part with edible use
|part used=Leaves
|part used for=Herbs
}}{{Has part with edible use
|part used=Leaves
|part used for=Dried
}}{{Has part with edible use
|part used=Flowers
|part used for=Salads
}}
|material part and use=
|medicinal part and use=
|sun=full sun
|shade=no shade
|hardiness zone=7
|heat zone=
|water=low
|drought=tolerant
|soil water retention=well drained, moist
|soil texture=sandy, loamy
|soil ph=acid, neutral, alkaline
|wind=No
|maritime=Yes
|native range=South Europe, West Asia
|ecosystem niche=Shrub
|life cycle=perennial
|herbaceous or woody=woody
|deciduous or evergreen=evergreen
|
|fertility=self fertile
|mature measurement unit=metres
|mature height=1.2
|mature width=1.2
|flower colour=blue
|grow from=seed, cutting
|seed requires stratification=No
|seed dormancy depth=
|seed requires scarification=No
|seed requires smokification=No
|cutting type=semi-ripe
|bulb type=
|graft rootstock=
|edible parts=flowers, leaves
|edible uses=Herb, Salad, Dry
}}

Output of infobox-parser(input):

{ general:
   { common: 'Rosemary',
     family: 'Lamiaceae',
     primaryImage: 'Rosmarinus officinalis.jpg',
     forage: 'Plant provides forage for',
     ediblePartAndUse: 'Has part with edible use',
     partUsed: 'Flowers',
     partUsedFor: 'Salads',
     sun: 'full sun',
     shade: 'no shade',
     hardinessZone: '7',
     water: 'low',
     drought: 'tolerant',
     soilWaterRetention: 'well drained, moist',
     soilTexture: 'sandy, loamy',
     soilPh: 'acid, neutral, alkaline',
     wind: 'No',
     maritime: 'Yes',
     nativeRange: 'South Europe, West Asia',
     ecosystemNiche: 'Shrub',
     lifeCycle: 'perennial',
     herbaceousOrWoody: 'woody',
     deciduousOrEvergreen: 'evergreen',
     fertility: 'self fertile',
     matureMeasurementUnit: 'metres',
     matureHeight: '1.2',
     matureWidth: '1.2',
     flowerColour: 'blue',
     growFrom: 'seed, cutting',
     seedRequiresStratification: 'No',
     seedRequiresScarification: 'No',
     seedRequiresSmokification: 'No',
     cuttingType: 'semi-ripe',
     edibleParts: 'flowers, leaves',
     edibleUses: 'Herb, Salad, Dry' },
  tables: [],
  lists: [] }

Output of wtf_wikipedia(input).templates():

[ { forage: 'Bees', template: 'plant provides forage for' },
  { 'part used': 'Leaves',
    'part used for': 'Herbs',
    template: 'has part with edible use' },
  { 'part used': 'Leaves',
    'part used for': 'Dried',
    template: 'has part with edible use' },
  { 'part used': 'Flowers',
    'part used for': 'Salads',
    template: 'has part with edible use' },
  { common: 'Rosemary',
    family: 'Lamiaceae',
    'primary image': 'Rosmarinus officinalis.jpg',
    sun: 'full sun',
    shade: 'no shade',
    'hardiness zone': '7',
    water: 'low',
    drought: 'tolerant',
    'soil water retention': 'well drained, moist',
    'soil texture': 'sandy, loamy',
    'soil ph': 'acid, neutral, alkaline',
    wind: 'No',
    maritime: 'Yes',
    'native range': 'South Europe, West Asia',
    'ecosystem niche': 'Shrub',
    'life cycle': 'perennial',
    'herbaceous or woody': 'woody',
    'deciduous or evergreen': 'evergreen',
    list: [ '' ],
    fertility: 'self fertile',
    'mature measurement unit': 'metres',
    'mature height': '1.2',
    'mature width': '1.2',
    'flower colour': 'blue',
    'grow from': 'seed, cutting',
    'seed requires stratification': 'No',
    'seed requires scarification': 'No',
    'seed requires smokification': 'No',
    'cutting type': 'semi-ripe',
    'edible parts': 'flowers, leaves',
    'edible uses': 'Herb, Salad, Dry',
    template: 'plant' } ]

l0n3star commented 5 years ago

Good idea to use partial parses. I will write the tests before completing the parser. This way I'll have clarity.

petteripitkanen commented 5 years ago

I could do the parser, it doesn't seem to take that many lines to write a recursive descent parser that produces an AST with the limitation of accepting only inputs where the tokens {{ and }} can be part of template expressions (and not within HTML constructs).

For now I'd probably accept a PR that extends practicalplants.js to include raw wikitext (as explained on previous comment). Your tool could be useful for fetching the raw wikitexts for this PR.

l0n3star commented 5 years ago

Sounds fair. I might even pick up the dragon book to understand more on compiler design :)

l0n3star commented 5 years ago

I found a parsing library called chevrotain. It lets you define a grammar and generates an AST for you. Mind if I try this out or do you still prefer to write your own parser?

petteripitkanen commented 5 years ago

I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.

You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.

l0n3star commented 5 years ago

I understand. I think it makes sense for powerplant to be further developed then. Thank you for your extremely valuable feedback on my PR's.

On Tue, Oct 8, 2019 at 6:42 PM petteripitkanen notifications@github.com wrote:

I have removed the "good first issue" label from all issues since I feel that none of them are defined clearly enough to be done by a newcomer who by definition doesn't have an overall view of the project. As we are currently in the process of defining this project more clearly, the conditions are not easy for small contributions. Once the development gets more stable I'll perhaps also have a better view of tasks that would be good for newcomers.

You are welcome to continue exploring different ways to parse the practicalplants MediaWiki format (and powerplant in general), but please note that I also don't currently have a complete picture how powerplant should look like, so if you continue giving me input in the form of comments and PRs, I do try to eventually evaluate them, but the point of view of the evaluation is what would be good for powerplant overall, so it is likely that even if something works it won't get merged if it is against the overall design, because otherwise it would get largely reverted in the next commit.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Ecohackerfarm/powerplant/issues/83?email_source=notifications&email_token=ALAQO34VV4HLH7ZHPIZDA6LQNUZJFA5CNFSM4IMTMGQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAWGSJY#issuecomment-539781415, or mute the thread https://github.com/notifications/unsubscribe-auth/ALAQO367FYC5GYO7DBG5AKDQNUZJFANCNFSM4IMTMGQA .

Ecohackerfarm / powerplant

Tool for scraping crop data from practicalplants wiki #83