funderburkjim / Markup-Sanskrit-Names-of-Plants

Proposal for Project for one of Pawan Goyal's students
Other
0 stars 0 forks source link

snplist1 - listing all the <c> elements #12

Open funderburkjim opened 9 years ago

funderburkjim commented 9 years ago

Review

With snp04.txt, we have proceeded as far as we can by just examining the form of the data (capital letter sequences). However, we know (from general knowledge) that there are still some flaws in the markup. As mentioned in the note attentioned to Pawan in #11, there are

These two kinds of problems are the only ones I see currently.
In both cases, I think we need to generate lists. For the first case, we need to generate a list of the cases that need un-marking. For the second case, we need to generate a list of the cases that need to be merged.

snplist1

As the first step in generating these special purpose lists, I suggest a program to provide a nice listing of all marked phrases appearing in snp04.txt In this case, our output file will not be another version of snp.txt, so instead of naming our program snp05.py and our output file snp05.txt, let's call the program snplist1.py and the ouput snp04_list.txt (since we will be applying the program to snp04.txt). So, the program invocation will be

python snplist1.py data/snp04.txt data/snp04_list.txt

We are wanting to identify the text within the 'c' tags. This is complicated by the fact that sometimes this text starts on one line and continues to the next.

I noticed in your snp05.py that you were thinking about the f.read() function, which slurps an entire file into a string. This is we want to do first in snplist.py: read all of snp04.txt into a string.

The next thing is the find all c-tags (plus contents) in this string. One way to do this is with re.findall:

parts = re.findall(r'<c>.*?</c>', {the variable containing the file},flag=re.XXXXX)

For XXXXX, use the flag that allows the dot ('.') to match end-of-lines characters - see https://docs.python.org/2/library/re.html

Now, let the program loop through the parts, and write each part to the output file - but before writing, make a couple of adjustments to the part, needed so the 'line-crossing' parts are more useable

What the output file should now contain is a listing (one per line), of the c-tagged elements, in the order in which they occur in snp04.

Compare the first few lines of the output to snp04.txt, and be sure you've got all the cases.

When you're done, mention it here and sync to Github.

The next step will be to extend snplist1.py to snplist2.py, so that the output file is easier to use in terms of our two objectives. I'll write the program specification for snplist2 when you've finished snplist1.

funderburkjim commented 9 years ago

@beenooyadav After writing the above, I noticed you had made a 'create_list' program --- You should find it easy to write snplist1 - it will be similar to your create_list.

beenooyadav commented 9 years ago

@funderburkjim after writing snplist1 the only problem is in snp04_list 13- 14 line where there is two line c- tag is there no matter what i do it's other part is starting on newline <c>C. AGALLOCHA</c>

gasyoun commented 9 years ago

To unmark the mismatched we need a list. In both cases, I think we need to generate lists - fully agree.

beenooyadav commented 9 years ago

@gasyoun did you see the snp04_list.txt

beenooyadav commented 9 years ago

@funderburkjim snplist1.py and snp04_list.txt done

funderburkjim commented 9 years ago

@beenooyadav
Your output looks fine!

I have a couple of stylistic suggestions regarding the code snplist1.py -- You don't need to change snplist1.py, but you might apply these ideas in the next refinement, snplist2.py.

for x in l:
 x = re.sub(r'\r\n',' ',x)
 # etc
 w.write("%s\n" % x)
  x = re.sub(r'[\r\n]+,'',x)  # this should handle both lines 15, 16

These are all minor observations.

The hardest part, at least for me, was learning about the re.DOTALL flag needed in the re.findall.

Next issue will suggest a refinement of snplist1.