MuiseDestiny / zotero-attanger

Attachment Manager for Zotero
GNU Affero General Public License v3.0
717 stars 13 forks source link

What does "Match attachment" do? #178

Closed ALonelySheep closed 1 month ago

ALonelySheep commented 1 month ago

Hello wonderful developer(s).

How does the "Match attachment" feature work under the hood? What is the matching policy? How does it differ from the "Attach new file" feature?

Servinjesus1 commented 1 month ago

The function in question is the following:

function matchAttachment ``` async function matchAttachment() { const items = ZoteroPane.getSelectedItems() .filter((i) => i.isTopLevelItem() && i.isRegularItem()) .sort((a, b) => getPlainTitle(a).length - getPlainTitle(b).length); ztoolkit.log( "item titles: ", items.map((i) => i.getDisplayTitle()), ); const sourceDir = await checkDir("sourceDir", "source path"); if (!sourceDir) return; let files: OS.File.Entry[] = []; /* TODO: migrate to IOUtils */ await Zotero.File.iterateDirectory( sourceDir, async function (child: OS.File.Entry) { if (!child.isDir && /\.(caj|pdf)$/i.test(child.name)) { files.push(child); } }, ); ztoolkit.log( "found pdf files:", files.map((f) => f.path), ); const readPDFTitle = getPref("readPDFtitle") as string; ztoolkit.log("read PDF title: ", readPDFTitle); for (const item of items) { const itemtitle = getPlainTitle(item); ztoolkit.log("processing item: ", itemtitle); let iniDistance = Infinity; let matchedFile: OS.File.Entry | undefined = undefined; for (const file of files) { let filename = file.name.replace(/\..+?$/, ""); /* 尝试从PDF元数据或文本中读取标题 */ try { if (!/pdf/i.test(Zotero.File.getExtension(file.path))) { throw new Error("This is not a PDF file."); } ztoolkit.log("check file:", file.name + ": "); const data: any = await getPDFData(file.path); const lines: Array = []; data.pages.forEach((page: Array) => { page[page.length - 1][0][0][0][4].forEach( (line: Array>>) => { const lineObj = { fontSize: 0, text: "" }; line[0].forEach((word) => { lineObj.fontSize += word[4]; lineObj.text += word[word.length - 1] + (word[5] > 0 ? " " : ""); }); lineObj.fontSize /= line[0].length; // ztoolkit.log(lineObj); lines.push(lineObj); }, ); }); const optTitle = data?.metadata?.title || data?.metadata?.Title || lines .reduce( (max, cur) => { if (cur.fontSize > max.fontSize) { return cur; } else if (cur.fontSize == max.fontSize) { max.text += ` ${cur.text}`; } return max; }, { fontSize: -Infinity, text: "" }, ) .text.replace(/\s?([\u4e00-\u9fff])\s?/g, "$1"); ztoolkit.log("optical title: ", optTitle); if ( readPDFTitle != "Never" && optTitle && (!/[\u4e00-\u9fff]/.test(itemtitle) || readPDFTitle == "Always") ) { filename = cleanLigature(optTitle); } } catch (e: any) { ztoolkit.log(e); } ztoolkit.log("filename:", filename); const distance = comparison.metricLcs.distance( itemtitle.toLowerCase(), filename.toLowerCase(), ); ztoolkit.log(`【${itemtitle}】 × 【${filename}】 => ${distance}`); if (distance <= iniDistance) { iniDistance = distance; matchedFile = file; } } if (matchedFile) { ztoolkit.log("==>", itemtitle, matchedFile.path, iniDistance); const attItem = await Zotero.Attachments.importFromFile({ file: matchedFile.path, libraryID: item.libraryID, parentItemID: item.id, }); showAttachmentItem(attItem); if (!attItem.parentItemID) { Zotero.RecognizeDocument.autoRecognizeItems([attItem]); } removeFile(matchedFile.path); files = files.filter((file) => file !== matchedFile); } } } ```

Let me Claude that for you:

This function, matchAttachment(), appears to be part of a Zotero plugin or extension. Its primary purpose is to match PDF or CAJ files from a source directory to selected Zotero items based on their titles. Here's a breakdown of what the function does:

  1. It selects top-level, regular Zotero items and sorts them by title length.

  2. It prompts the user to select a source directory.

  3. It scans the source directory for PDF and CAJ files.

  4. For each selected Zotero item, it tries to find a matching file in the source directory by comparing the item's title with the filenames.

  5. If enabled, it attempts to read the title from the PDF metadata or content, which can be used instead of the filename for matching.

  6. It uses a string distance algorithm (metric LCS) to find the best match between the Zotero item title and the filename or PDF title.

  7. When a match is found, it imports the file as an attachment to the Zotero item.

  8. After attaching, it attempts to automatically recognize the document (probably for metadata extraction).

  9. Finally, it removes the matched file from the source directory and the list of files to process.

The function includes several logging statements for debugging purposes. It also handles different scenarios, such as reading PDF titles based on user preferences and dealing with Chinese characters in titles.

Would you like me to explain any specific part of this function in more detail?