ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

step by step #61

Open Zuckonit opened 4 years ago

Zuckonit commented 4 years ago

I read the doc, but still meet some problem. I have 20 cbeta xml(20 diff label, assume 1 to 20), and I wanna make a diff result of them. could you please to provide a 'step-by-step' tutor of this.

ajenhl commented 4 years ago

Sure. Here are the steps, assuming that the CBETA XML files are in a directory called xml_dir, that you want 1-6-grams, and that the catalogue is called catalogue.txt:

  1. Create the corpus from the XML: tacl prepare source_dir xml_dir tacl strip xml_dir corpus_dir

  2. Create the database: tacl ngrams cbeta.db corpus_dir 1 6

  3. Run the diff: tacl diff cbeta.db corpus_dir catalogue.txt > diff-results.csv

Does this help?

Zuckonit commented 4 years ago

how about corpus_dir? what does it contains, and how can I make one

ajenhl commented 4 years ago

corpus_dir is created by tacl strip - it takes the files in xml_dir (itself created as the output of tacl prepare) and outputs the stripped versions of them in whatever you specify as corpus_dir.

In my example, xml_dir, corpus_dir, catalogue.txt, cbeta.db, and diff-results.csv are all paths that you specify. Only in the case of catalogue.txt do you need to have any content there before running those commands in that sequence.

Faxinrepent commented 4 years ago

Sure. Here are the steps, assuming that the CBETA XML files are in a directory called xml_dir, that you want 1-6-grams, and that the catalogue is called catalogue.txt:

  1. Create the corpus from the XML: tacl prepare source_dir xml_dir tacl strip xml_dir corpus_dir
  2. Create the database: tacl ngrams cbeta.db corpus_dir 1 6
  3. Run the diff: tacl diff cbeta.db corpus_dir catalogue.txt > diff-results.csv

Does this help?

It is helpful! Could you please write how to manipulate results (by tacl results/align/highlight ) as this case?Because I had trouble in them, like the attached image, even though pandas, biopython, etc are all installed. Thanks a lot for writing and sharing this software! error

ajenhl commented 4 years ago

So in that case, as per the last line of the error text, there is no results file diff-result.csv in that directory, so it is unable to manipulate those results. Presumably the results are either in a file with a different name, or in a different directory, or both.