snplist1

funderburkjim commented 9 years ago

Review

With snp04.txt, we have proceeded as far as we can by just examining the form of the data (capital letter sequences). However, we know (from general knowledge) that there are still some flaws in the markup. As mentioned in the note attentioned to Pawan in #11, there are

some capital letter sequences are NOT part of botanical phrases, like 'HB' in line 82. We need to unmark these, once we identify them
some marked phrases are parts of botanical phrases, but need to be merged. line 94 provides an example

These two kinds of problems are the only ones I see currently.
In both cases, I think we need to generate lists. For the first case, we need to generate a list of the cases that need un-marking. For the second case, we need to generate a list of the cases that need to be merged.

As the first step in generating these special purpose lists, I suggest a program to provide a nice listing of all marked phrases appearing in snp04.txt In this case, our output file will not be another version of snp.txt, so instead of naming our program snp05.py and our output file snp05.txt, let's call the program snplist1.py and the ouput snp04_list.txt (since we will be applying the program to snp04.txt). So, the program invocation will be

python snplist1.py data/snp04.txt data/snp04_list.txt

We are wanting to identify the text within the 'c' tags. This is complicated by the fact that sometimes this text starts on one line and continues to the next.

I noticed in your snp05.py that you were thinking about the f.read() function, which slurps an entire file into a string. This is we want to do first in snplist.py: read all of snp04.txt into a string.

The next thing is the find all c-tags (plus contents) in this string. One way to do this is with re.findall:

parts = re.findall(r'<c>.*?</c>', {the variable containing the file},flag=re.XXXXX)

For XXXXX, use the flag that allows the dot ('.') to match end-of-lines characters - see https://docs.python.org/2/library/re.html

Now, let the program loop through the parts, and write each part to the output file - but before writing, make a couple of adjustments to the part, needed so the 'line-crossing' parts are more useable

replace end of line characters (r'\r\n') with a space,
remove <>

What the output file should now contain is a listing (one per line), of the c-tagged elements, in the order in which they occur in snp04.

Compare the first few lines of the output to snp04.txt, and be sure you've got all the cases.

When you're done, mention it here and sync to Github.

The next step will be to extend snplist1.py to snplist2.py, so that the output file is easier to use in terms of our two objectives. I'll write the program specification for snplist2 when you've finished snplist1.

funderburkjim commented 9 years ago

@beenooyadav After writing the above, I noticed you had made a 'create_list' program --- You should find it easy to write snplist1 - it will be similar to your create_list.

beenooyadav commented 9 years ago

@funderburkjim after writing snplist1 the only problem is in snp04_list 13- 14 line where there is two line c- tag is there no matter what i do it's other part is starting on newline <c>C. AGALLOCHA</c>

gasyoun commented 9 years ago

To unmark the mismatched we need a list. In both cases, I think we need to generate lists - fully agree.

beenooyadav commented 9 years ago

@gasyoun did you see the snp04_list.txt

beenooyadav commented 9 years ago

@funderburkjim snplist1.py and snp04_list.txt done

funderburkjim commented 9 years ago

@beenooyadav
Your output looks fine!

I have a couple of stylistic suggestions regarding the code snplist1.py -- You don't need to change snplist1.py, but you might apply these ideas in the next refinement, snplist2.py.

file is empty of comments - all programs need comments!
module unicodedata is not used, so you don't need to import it at line 2
Because of the way the list of tags l was chosen at line 13, the if condition at line 18 ALWAYS succeeds, and the else condition at line 21 never succeeds. Thus, this if then else is not needed, --- just replace it with the write statements (lines 19,20)
Regarding those write statements, I think it more aesthetically pleasing to combine both writes into one statement: w.write("%s\n" %l[i])
In the for loop, since the index 'i' is only needed in the form l[i] (assuming line 22 is already gone), then it would be clearer to use an iterator over the list 'l':

for x in l:
 x = re.sub(r'\r\n',' ',x)
 # etc
 w.write("%s\n" % x)

You could combine the regex-substitutions of lines 15 and 16 into one pattern (I probably led you slightly astray by using r'\r\n'):

  x = re.sub(r'[\r\n]+,'',x)  # this should handle both lines 15, 16

These are all minor observations.

The hardest part, at least for me, was learning about the re.DOTALL flag needed in the re.findall.

Next issue will suggest a refinement of snplist1.

funderburkjim / Markup-Sanskrit-Names-of-Plants

snplist1 - listing all the <c> elements #12

Review

snplist1