如何用正则表达式删除多余的作者项？

sikouhjw commented 3 years ago

检查

[x] 已在 issues 中进行搜索（包括已关闭的问题）

编译环境

操作系统
- [x] Windows 10
- [ ] Windows 8/8.1
- [ ] Windows 7
- [ ] 更早版本的 Windows
- [ ] macOS
- [ ] Linux（请附发行版）
TeX 发行版
- [x] TeX Live 2021
- [ ] MiKTeX
- [ ] CTeX 套装 2.9.2.164
- [ ] 更早版本的 CTeX 套装

描述问题

背景：导师投稿，期刊规定了 bib 的样式为 plain，但是又规定了参考文献的字数。参考文献太多了，需要删除多余的作者项（即将位数大于三的作者删掉）。

最小工作示例（MWE）

现在的 bbl 文件：

\bibitem{aaa}
O.~O. Abudayyeh, J.~S. Gootenberg, P.~Essletzbichler, S.~Han, J.~Joung, J.~J.
  Belanto, V.~Verdine, D.~B.~T. Cox, M.~J. Kellner, A.~Regev, E.~S. Lander,
  D.~F. Voytas, A.~Y. Ting, and F.~Zhang.
\newblock Rna targeting with crispr-cas13.
\newblock {\em Nature}, 550(7675):280--+, 2017.

理想效果：

\bibitem{aaa}
O.~O. Abudayyeh, J.~S. Gootenberg, P.~Essletzbichler.
\newblock Rna targeting with crispr-cas13.
\newblock {\em Nature}, 550(7675):280--+, 2017.

或者

\bibitem{aaa}
O.~O. Abudayyeh, J.~S. Gootenberg, and P.~Essletzbichler.
\newblock Rna targeting with crispr-cas13.
\newblock {\em Nature}, 550(7675):280--+, 2017.

muyuuuu commented 3 years ago

我这边也遇到了类似情况，果然导师要求全国统一。

不过我这边是：不使用 cite，而是手写引用。这样内容会少一些，然后把多余内容放 github 作为额外附录。也不知道脑子里装的啥，不用交叉引用，文献全部手动编号，页数限制也相当于没有。

回到问题，如果是我，我会写个 python 脚本，直接删掉其余作者。latex 是否支持正则这样的编程，我就不知道了。

sikouhjw commented 3 years ago

我这边也遇到了类似情况，果然导师要求全国统一。

不过我这边是：不使用 cite，而是手写引用。这样内容会少一些，然后把多余内容放 github 作为额外附录。也不知道脑子里装的啥，不用交叉引用，文献全部手动编号，页数限制也相当于没有。

回到问题，如果是我，我会写个 python 脚本，直接删掉其余作者。latex 是否支持正则这样的编程，我就不知道了。

不是 latex 层面，是文本编辑器层面去正则。

如果用 python，该怎么做呢（捂脸）。

stone-zeng commented 3 years ago

先拿到 \bibitem 和 \newblock 之间的行，拼起来之后按 , split 开，拿出前三项再替换回去。

muyuuuu commented 3 years ago

data = []
string = ""

with open('test.bbl', 'r') as f:
    lines = f.readlines()
    for line in lines:
        line = line.strip(' ')
        if 'bibitem' in line:
            print(line)
            string = ""
            data.append(line)
        elif 'newblock' in line:
            if string is not None:
                tmp = ""
                for item in string.split(',')[:3]:
                    tmp += item.strip(' ')
                    tmp += ', '
                #print(tmp)
                data.append(tmp[:-2] + '\n')
            string = None
            data.append(line)
        else:
            string += line.strip('\n')
        #print(data)

with open('result.bbl', 'w') as f:
    f.writelines(data)

稍微补充点注释：

如果这一行是 bibitem 的，追加，并对 string 赋空值，准备添加后面的作者
如果这一行是 newblock 的，第一次出现 newblock ，处理作者，也就是东哥说的 split，第二次出现 newblock 作者已经处理完，不用在处理作者。第一次出现还是第二次出现，我用的通过 None 来判断
如果不是 bibitem 也不是 newblock，那就是作者，去除末尾的回车，添加到 string 中。
我没有加作者不足三人的判断，我相信你能补充上。作者只有一人或两人上面程序会报错（

muyuuuu commented 3 years ago

不过东哥说的貌似更简单，判断：

如果是 bibitem，开始调用 next(lines)，直到不为 newblock
把上面调用的 next(lines) 存起来，按逗号分割，顶替回去

不过这里不是程序设计交流群，我就不多说了。

zepinglee commented 3 years ago

去掉三个以上的作者时要加 et al. 呀。

syvshc commented 3 years ago

import re

def findSubstring(string,substring,times):
    current = 0
    for i in range(1,times+1):
        current = string.find(substring,current+1)
        if current == -1 :  return -1

    return current

f = open('./main.bbl', 'r')
t = open('./target.bbl', 'w')
# print(f.read())
out = ""
bibitem_num = 0
newblock_num = 0
for line in f:
    tmp = line
    if re.search("^\\\\bibitem", tmp) is not None:
        newblock_num = 0
    if re.search("^\\\\newblock", tmp) is not None:
        newblock_num =  1 
    if re.search("^\\\\", tmp) is None and newblock_num == 0:
        out = out + line.strip()
    elif out == "":
        t.write(tmp)
    else:
        num = findSubstring(out, ",", 3)
        out = out[:num]
        t.write(out + "\n" + tmp)
        out = ""
        newblock_num = 0

要被更改的文件为 main.bbl, 写入的文件为 target.bbl, 可以先测试一下是否有用, python 这方面用的不多, 如果可以的话吃个饭回来写点注释

sikouhjw commented 3 years ago

import re

def findSubstring(string,substring,times):
    current = 0
    for i in range(1,times+1):
        current = string.find(substring,current+1)
        if current == -1 :  return -1

    return current

f = open('./main.bbl', 'r')
t = open('./target.bbl', 'w')
# print(f.read())
out = ""
bibitem_num = 0
newblock_num = 0
for line in f:
    tmp = line
    if re.search("^\\\\bibitem", tmp) is not None:
        newblock_num = 0
    if re.search("^\\\\newblock", tmp) is not None:
        newblock_num =  1 
    if re.search("^\\\\", tmp) is None and newblock_num == 0:
        out = out + line.strip()
    elif out == "":
        t.write(tmp)
    else:
        num = findSubstring(out, ",", 3)
        out = out[:num]
        t.write(out + "\n" + tmp)
        out = ""
        newblock_num = 0

要被更改的文件为 main.bbl, 写入的文件为 target.bbl, 可以先测试一下是否有用, python 这方面用的不多, 如果可以的话吃个饭回来写点注释

这段代码唯一的问题是作者的末尾没有 .。第 31 行改成

        t.write(out + '.' + "\n" + tmp)

就可以了。感谢！

syvshc commented 3 years ago

...
    if re.search("^\\\\", tmp) is None and newblock_num == 0:
        out = out + line.strip()
    elif out == "":
        t.write(tmp)
    else:
        num = findSubstring(out, ",", 3)
        out = out[:num]
        t.write(out + "\n" + tmp)
        out = ""
        newblock_num = 0

这里未忽略 author 中以 \ 开头的情况, 如 author 中某行以 \& 开头, 将会直接进人 else 情况, 故进行了修改, 并添加了注释

import re

# 返回 string 中 第 times 次出现 substring 的位置
def findSubstring(string,substring,times):
    current = 0
    for i in range(1,times+1):
        current = string.find(substring,current+1)
        if current == -1 :  return -1

    return current

main = open('./main.bbl', 'r')
target = open('./target.bbl', 'w')

# 用于存储不同行的 author 信息
out = ""
# 用于记录 author 写入 out 的截止点
is_newblock = False

for line in main:
    tmp = line
    # 如果新开始一个 \bibitem, 置 is_newblock 为假, 输出 \bibitem 行, 并进入下一层循环
    if re.search("^\\\\bibitem", tmp) is not None:
        is_newblock = False
        target.write(tmp)
        continue
    # 当遇到了 \newblock, 置 is_newblock 为真
    if re.search("^\\\\newblock", tmp) is not None:
        is_newblock =  True 
    # 当没有遇到 \newblock
    if is_newblock == False:
        # 将头尾的空字符删除, 接在 out 后面
        out = out + line.strip()
    # 如果 out 为空, 即在上一次清空 out 后没有写入内容时
    # 表明上一行已经输出了 out 且以 \ 开头, 或此时已经在第一个 \newblock 及以后, 直接在 target.bbl 中原样写入
    elif out == "":
        target.write(tmp)
    # out 中有内容, 说明上一次进入了第一层 if, 但是这次没有进入
    # 即此时遇到了第一个 \newblock, 可以将 out 中的 author 内容处理有输出
    else:
        # 查找第三个逗号的位置, 删除其后的内容并输出到 target.bbl, 并且重置 out 的值
        num = findSubstring(out, ",", 3)
        out = out[:num]
        target.write(out + ".\n" + tmp)
        out = ""

CTeX-org / forum