funderburkjim / Markup-Sanskrit-Names-of-Plants

Proposal for Project for one of Pawan Goyal's students
Other
0 stars 0 forks source link

Ideas regarding the SNP capitalization problem #3

Open funderburkjim opened 9 years ago

funderburkjim commented 9 years ago

Here is an idea for a first program. Let's keep the python programs in a folder pywork and write our output to a subfolder pywork/data. Keep a file 'notes.txt' which will indicate how to run the programs, and brief explanations of what the programs do.

python snp01.py ../snp.txt data/snp01.txt

This program reads the original copy of the digitization (snp.txt) and writes another version as data/snp01.txt. The input file should be read as utf8 and the output file written as utf8; the Python2 codecs system library will help do this. snp01.py will only modify lines in the range 81-3313 - the other lines will be copied to snp01.txt without change.

In our example CARPOPOGON PRURIENS is the scientific name of a certain plant. We want snp01.py to identify all sequences of capitalized words which represent scientific names of plants.

snp01.py will identify sequences of 1 or more capitalized words and enclose such sequences with markup, such as

<c>CARPOPOGON PRURIENS</c>

Let's also allow some punctuation in our sequences, such as (from line 94 of snp.txt):

<P>(1) a) Chopra: [<c>COMMIPHORA ROXBURGHII (ARN.) ENGL.</c>] = [C.

Let's start with punctuation left-paren, right-paren, and period.

Let's also restrict snp01.py to look for sequences just within each line of snp.txt. It is certain that a second better version will have to take care of sequences that begin in one line and end in the next, but let's skip this complication to start with.

By saying sequences of 1 or more capitalized words, we can be sure that some unwanted words (not scientific names of plants) will be marked. A subsequent version (snp02.py or snp03.py, etc) will have to deal with this issue.

So, after completion, here's what I expect snp01.txt will look like for lines 90-99:

<>but Vs4s identifies it as <c>CARPOPOGON PRURIENS</c>; see: {%kapi-
<>kacchu1¤;%}
<P>(4) = {%devas4iri1s2a%} (Vs4s, unidentified).
<H>{%agaru%} (sometimes considered to be a synonym of {%aguru%})
<P>(1) a) Chopra: [<c>COMMIPHORA ROXBURGHII (ARN.) ENGL.</c>] = [<c>C.</c>
<><c>AGALLOCHA (WIGHT ET ARN.) ENGL.</c>] = <c>BALSAMODENDRUM</c>
<><c>ROXBURGHII ARN.</c> = <c>AMYRIS COMMIPHORA ROXB.</c>;
<P>b) <c>KB</c> 1, p. 528-529: <c>COMMIPHORA AGALLOCHA ENGL.</c> = <c>BALSA</c>-
<><c>MODENDRUM ROXBURGHIII ARN.</c> = <c>AMYRIS COMMIPHORA</c>
<><c>ROXB.</c>; the Index Kewensis disagress with Chopra and <c>KB</c>

Again, this is definitely NOT the ending markup, but doing the markup to this degree seems like a good first step.

funderburkjim commented 9 years ago

Benoo/Shubham - You're ready to write the first program, snp01.py as described above.

A procedural comment re using the Github clone of MSNP. Open your Github client, select the MSNP repository, and then use the 'Open in Explorer' option to open your local copy of the repository: image

Make a new folder 'pywork' in the repository, and a new subfolder of 'pywork' called 'pywork/data'. Start a new python program 'snp01.py' in the pywork folder, and make it function like the description above.

When you think the program is done:

Then I'll take a look and we'll proceed.

Some questions for me may arise in this process. Just submit them as questions in this or another issue, and I'll respond.

BTW -

beenooyadav commented 9 years ago

funderburkjim- we are using the python 2.7 version and notepad as text editor btw we are learning python right now

funderburkjim commented 9 years ago

There are a few things you need to change in snp01.py. First, there are some 'code organization' changes. These are somewhat arbitrary choices on my part, but I think they are satisfactory for this project.

python snp01.py ../snp.txt data/snp01.txt  

Then, do another Github commit

os.chdir("C:\Users\BEENOO\Markup-Sanskrit-Names-of-Plants")

It's usually not a good idea to 'hard code' absolute paths in a program. For instance, I can't use this program, since I don't have a 'C:\Users\BEENOO' directory on my computer. Also, someone on a Macintosh or on a Linux system would also not have this path.

import sys
filein = sys.argv[1]  # this will be '../snp.txt' in our example
fileout = sys.argv[2] # this will be data/snp01.txt in our example

Then change your f=' andw=` statements to use filein and fileout.

funderburkjim commented 9 years ago

One comment about text editors. While Windows notepad is ok for editing small text files like snp01.py, it is not so good for viewing large files like snp.txt or snp01.txt.

For instance, I just opened snp.txt in Notepad. -- and it is a mess, in part because file has unix-style end-of-line character (\n), which Notepad doesn't understand.

I usually use Emacs to view such files. But a simpler editor would be Notepad++ for Windows, and unless you already have another good text editor which shows snp.txt properly, I would suggest you download Notepad++. It has a nice GUI interface. I've just downloaded it myself, and it shows snp.txt just fine. You will need to be able to properly view snp.txt, snp01.txt, etc. as this project progresses, so you can 'see' what your programs have done.

Still another option, which many professional programmers like, is VIM. You might want to learn VIM (or Emacs) sometime, but there is a steep learning curve to both, and you don't really need either of these for this MSNP project.

gasyoun commented 9 years ago

Why <c>COMMIPHORA ROXBURGHII (ARN.) ENGL.</c> instead of just <c>COMMIPHORA ROXBURGHII</c>? Should not (ARN.) ENGL. be left out? Even for hyperlinking.

After reading It's usually not a good idea to 'hard code' absolute paths in a program. I understand that you might be a good teacher as well. Too many talents.

funderburkjim commented 9 years ago

@gasyoun Regarding whether (ARN.) ENGL. should be left out.

I'm not clear on the significance of such 'secondary' designations. Do you have a reference that discusses this aspect of Botanical naming? If we better understood the botanical conventions, we could make a more informed decision about this detail of marking up SNP.

gasyoun commented 9 years ago

First let's find out if there is a list of abbreviations in the book.

funderburkjim commented 9 years ago

@beenooyadav

Your comment in notes/'how to run snp01.txt' is more complicated that it needs to be. I have prepared a file pywork/notes.txt with a shorter explanation.

There is an explanation 1a for using the 'cmd.exe' terminal program in WIndows, and an explanation 1b for using 'GitBash' terminal program in Windows. Notice there is a slight difference in the two explanations.

I assume you are using cmd.exe - is this right?

Is the explanation in 1a clear to you ?

funderburkjim commented 9 years ago

As you may have noticed, there are many 'small details' that come into play when writing open software, due to the fact that people run programs with slightly different computer setups.

One other small detail I want to mention just in passing (we don't have to deal with it right now -- maybe later if there is time). This detail has to do with line-endings in text files.

I ran your program with the following command-line (for testing purposes): python snp01.py ..\snp.txt data/snp01test.txt

Note that the output file is slightly renamed -- I didn't want to clobber your file (data\snp01.txt).

Then I did a file comparison, using the 'diff' command under GitBash: diff data/snp01.txt data/snp01test.txt > temp

I expected there to be no difference (i.e., that the temp output file from diff would be empty). Then, I used the 'wc' (word-count) program available in GitBash to count the number of lines in the 'temp' file: wc -l temp 9074 temp

This means that there are 9074 lines in temp!
Now in snp.txt, there are 4536 lines (same for snp01.txt and snp01test.txt). So, there is a difference in EVERY line, when comparing your file snp01.txt and my test run snp01test.txt.

What's happening? Well, There is an option on 'diff' that lets the comparison IGNORE WHITE SPACE differences, and when that is done there is no difference:

diff -w data/snp01.txt data/snp01test.txt > temp wc -l temp 0 temp

Further investigation (using emacs to look at the first temp), showed that the difference is that, in your output, there is an extra '\r' character at the end of every line in your snp01.txt.

Since this is not exactly an error, but is due to some small difference between the way your Python is running and the way my Python is running, I don't want to dwell on it for now.

We have more significant issues to deal with at the moment.

After writing this comment, I deleted all the 'temporary' files (temp, data/snp01test.txt) from my local copy of MSNP; so they do not need to clutter up the 'official' repository.

gasyoun commented 9 years ago

@funderburkjim learning the small steps is important, but sure we can leave it for now.