Open rtega opened 6 years ago
I added the following lines in line 143 of book.c:
if(strstr(result,"嗜む")) { printf("boef: %s %i %i\n",result,position->page,position->offset); }
which yields the following result:
boef: tashinamu <たしなむ【嗜む】> 30827 984
boef: たしなむ【嗜む】 <..> 138094 1506
boef: たしなむ【嗜む】 33548 130 boef: たしなむ【嗜む】 <..> 138094 1506
boef: 嗜む 38028 1326 boef: たしなむ【嗜む】 <..> 138094 1506 Basically whats happening is that there are three headers in the dictionary which all refer to the same article. Only the last header is exported.
Basically, things go wrong in book_undupe(book); We need to be smarter about what we are removing.
I would propose to save the heading with the largest content when removing in book_undupe(book). I don't understand your code at first view. Could you have a look at it?
I changed the undupe code with this quicksort and removeduplicates. The resulting file is a bit smaller but it seems to work as it should. `void swap(Book_Entry a, Book_Entry b) { Book_Entry t = a; a = b; b = t; }
int partition_entries(Book_Entry arr[], int low, int high) { Book_Entry * pivot = &arr[high]; // pivot int i = (low - 1); // Index of smaller element
for (int j = low; j <= high- 1; j++)
{
// If current element is smaller than or
// equal to pivot
if (arr[j].text.page < pivot->text.page)
{
i++; // increment index of smaller element
swap(&arr[i], &arr[j]);
}
if(arr[j].text.page == pivot->text.page)
{
if(arr[j].text.offset < pivot->text.offset)
{
i++;
swap(&arr[i],&arr[j]);
if(arr[j].text.offset == pivot->text.offset)
{
if(strlen(arr[j].heading.text) <= strlen(pivot->heading.text))
{
i++;
swap(&arr[i],&arr[j]);
}
}
}
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);
}
/ The main function that implements QuickSort arr[] --> Array to be sorted, low --> Starting index, high --> Ending index / void quickSort_entries(Book_Entry arr[], int low, int high) { if (low < high) { / pi is partitioning index, arr[p] is now at right place / int pi = partition_entries(arr, low, high);
// Separately sort elements before
// partition and after partition
quickSort_entries(arr, low, pi - 1);
quickSort_entries(arr, pi + 1, high);
}
}
int removeDuplicates_subbook(Book_Subbook subbook) { int n = subbook->entry_count; Book_Entry arr = subbook->entries; // Return, if array is empty // or contains a single element if (n==0 || n==1) return n;
Book_Entry * temp = malloc(n*sizeof(Book_Entry));
// Start traversing elements
int j = 0;
for (int i=0; i<n-1; i++)
// If current element is not equal
// to next element then store that
// current element
if ((arr[i].text.page != arr[i+1].text.page) || (arr[i].text.offset != arr[i+1].text.offset))
temp[j++] = arr[i];
// Store the last element as whether
// it is unique or repeated, it hasn't
// stored previously
temp[j++] = arr[n-1];
// Modify original array
for (int i=0; i<j; i++)
arr[i] = temp[i];
subbook->entry_count = j;
free(temp);
return j;
}
static void subbook_undupe(Book_Subbook* subbook) { quickSort_entries(subbook->entries,0,subbook->entry_count -1); removeDuplicates_subbook(subbook); `
It crashes on gakken though.
And doesn't work as it should. Working on an updated version.
I think the easiest fix is just to check lengths when looking for dupes. If there is a dupe with a longer header length, swap it with the current entry and delete the dupe. You shouldn't have to sort anything.
That being said, I'm not sure you actually want to use headers for anything. All of that information can be found in the entry text, and you are going to have to parse all of that stuff out with regex anyway. Honestly, if anything, this made me wonder if I should even be exporting the headers out of zero-epwing as AFAIK they are just some weird artifact of the EPWING format.
For reference articles you don't have a header in the entry text itself: "heading": "¶両三日 <りょう2【両】>", "text": "・両三日 two or three days; a couple of days\n" I guess you really want to keep the info in the heading in that case. Take the example of 普通高等学校: "heading": "¶普通高等学校 <こうとうがっこう【高等学校】>", "text": "普通高等学校 a general [an ordinary, an academic] high school.\nこうとうかん【高等官】 {{w_46695}}(k{{n_41528}}t{{n_41528}}kan)\n" The heading is referring to 高等学校 while the text is referring to 高等官. You want to keep the info in the heading I think.
Looking at your code to remove dupes, I don't see how you can get at the entry which you are comparing from a Page-pointer solely.
The heading of たしなむ is "heading": "嗜む" while it should be "たしなむ【嗜む】"