Open funderburkjim opened 9 years ago
@beenooyadav After writing the above, I noticed you had made a 'create_list' program --- You should find it easy to write snplist1 - it will be similar to your create_list.
@funderburkjim
after writing snplist1 the only problem is in snp04_list 13- 14 line
where there is two line c- tag is there no matter what i do it's other part is starting on newline
<c>C.
AGALLOCHA</c>
To unmark the mismatched we need a list. In both cases, I think we need to generate lists - fully agree.
@gasyoun did you see the snp04_list.txt
@funderburkjim snplist1.py and snp04_list.txt done
@beenooyadav
Your output looks fine!
I have a couple of stylistic suggestions regarding the code snplist1.py -- You don't need to change snplist1.py, but you might apply these ideas in the next refinement, snplist2.py.
l
was chosen at line 13, the if
condition at line 18 ALWAYS succeeds,
and the else
condition at line 21 never succeeds. Thus, this if then else
is not needed, --- just replace
it with the write statements (lines 19,20)w.write("%s\n" %l[i])
for x in l:
x = re.sub(r'\r\n',' ',x)
# etc
w.write("%s\n" % x)
x = re.sub(r'[\r\n]+,'',x) # this should handle both lines 15, 16
These are all minor observations.
The hardest part, at least for me, was learning about the re.DOTALL flag needed in the re.findall.
Next issue will suggest a refinement of snplist1.
Review
With snp04.txt, we have proceeded as far as we can by just examining the form of the data (capital letter sequences). However, we know (from general knowledge) that there are still some flaws in the markup. As mentioned in the note attentioned to Pawan in #11, there are
These two kinds of problems are the only ones I see currently.
In both cases, I think we need to generate lists. For the first case, we need to generate a list of the cases that need un-marking. For the second case, we need to generate a list of the cases that need to be merged.
snplist1
As the first step in generating these special purpose lists, I suggest a program to provide a nice listing of all marked phrases appearing in snp04.txt In this case, our output file will not be another version of snp.txt, so instead of naming our program snp05.py and our output file snp05.txt, let's call the program snplist1.py and the ouput snp04_list.txt (since we will be applying the program to snp04.txt). So, the program invocation will be
We are wanting to identify the text within the 'c' tags. This is complicated by the fact that sometimes this text starts on one line and continues to the next.
I noticed in your snp05.py that you were thinking about the f.read() function, which slurps an entire file into a string. This is we want to do first in snplist.py: read all of snp04.txt into a string.
The next thing is the find all c-tags (plus contents) in this string. One way to do this is with re.findall:
For XXXXX, use the flag that allows the dot ('.') to match end-of-lines characters - see https://docs.python.org/2/library/re.html
Now, let the program loop through the parts, and write each part to the output file - but before writing, make a couple of adjustments to the part, needed so the 'line-crossing' parts are more useable
What the output file should now contain is a listing (one per line), of the c-tagged elements, in the order in which they occur in snp04.
Compare the first few lines of the output to snp04.txt, and be sure you've got all the cases.
When you're done, mention it here and sync to Github.
The next step will be to extend snplist1.py to snplist2.py, so that the output file is easier to use in terms of our two objectives. I'll write the program specification for snplist2 when you've finished snplist1.