bio-guoda / preston

a biodiversity dataset tracker
MIT License
26 stars 1 forks source link

explore analyzing Preston text corpora into text pattern analysis tools like voyant tools #112

Closed jhpoelen closed 3 years ago

jhpoelen commented 3 years ago

@debpaul pointed out https://voyant-tools.org as a tool that the humanities use to analyze texts.

from https://voyant-tools.org/docs/#!/guide/about-section-software-libraries - the web tool uses the following (java) libraries:

    Apache PDFBox for reading PDF documents
    Apache POI for reading Microsoft Office documents
    Apache Commons Math, Collections, File Upload, IO, Compress
    CyberNeko HTML Parser for reading (less than valid) HTML
    JAMA: Java Matrix Package for principal component and correspondence analysis in ScatterPlot
    MAchine Learning for LanguagE Toolkit (MALLET), especially for topic clustering
    Oracle Berkeley DB Java Edition for data storage
    Stanford Core Natural Language Processing, especially for named entity recognition in RezoViz
    XStream used to produce XML or JSON results
    Google Closure Compiler to compress Javascript files
    jQuery another Javascript framework used by some tools
    Sencha EXT JS the main Javascript framework used
jhpoelen commented 3 years ago

As far as I understand, Preston can be used in combination with mentioned tools to reliably access data of known provenance (aka data and their lineage).