Unescaped HTML tag `<script>` in bookmark title names

artshade commented 1 month ago

Which version of floccus are you using?

5.2.6

How many bookmarks do you have, roughly?

20k

Are you using other means to sync bookmarks in parallel to floccus?

No

Sync method

WebDAV

Which browser are you using? In case you are using the phone App, specify the Android or iOS version and device please.

Google Chrome, 127.0.6533.120 (Official Build) (64-bit)

Which version of Nextcloud Bookmarks are you using? (if relevant)

No response

Which version of Nextcloud? (if relevant)

No response

What kind of WebDAV server are you using? (if relevant)

RClone WebDav (built-in server).

$ rclone --version;
rclone v1.67.0
- os/version: alpine 3.20.0 (64 bit)
- os/kernel: 4.4.302+ (x86_64)
- os/type: linux
- os/arch: amd64
- go/version: go1.22.4
- go/linking: static
- go/tags: none

Describe the Bug

Sincere appreciation for the project! Thank you! ✨

When tried syncing bookmarks with the following bookmark added, the whole set of bookmarks got missing in clients after manual import. A bookmark example with <script> tag:

Title: HTML - <script> Tag - Tutorialspoint URL: https://www.tutorialspoint.com/html/html_script_tag.htm

Expected Behavior

Safe parsing of bookmarks data.

To Reproduce

Initial state of bookmarks

Push

The bookmarks file on WebDav remote:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE><!--- highestId :13: for Floccus bookmark sync browser extension -->
<DL><p>
<DT><H3 ID="1">test</H3>
<DL><p>
  <DT><A HREF="https://example.com/" TAGS="" ID="2">Example Domain</A>
  <DT><A HREF="https://example.com/test" TAGS="" ID="3">Example Domain (issue) <script> 123</A>
  <DT><A HREF="https://example.com/copy" TAGS="" ID="9">Example Domain (copy)</A>
  <DT><A HREF="https://example.com/jscs" TAGS="" ID="10">Example Domain (JavaScript comment start) /* 456</A>
  <DT><A HREF="https://example.com/copy2" TAGS="" ID="11">Example Domain (copy 2)</A>
  <DT><A HREF="https://example.com/jsce" TAGS="" ID="12">Example Domain (JavaScript comment end) "*/"? - Test</A></A>
  <DT><A HREF="https://example.com/copy3" TAGS="" ID="13">Example Domain (copy 3)</A>
</DL><p>
</DL><p>
</html>

Pull attempt

Pull ("fail-safe" disabled)

Push (2 time)

The bookmarks file on WebDav remote:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE><!--- highestId :13: for Floccus bookmark sync browser extension -->
<DL><p>
<DT><H3 ID="1">test</H3>
<DL><p>
  <DT><A HREF="https://example.com/" TAGS="" ID="2">Example Domain</A>
  <DT><A HREF="https://example.com/test" TAGS="" ID="3">Example Domain (issue)  123</A>   <DT><A HREF="https://example.com/copy" TAGS="" ID="9">Example Domain (copy)</A>   <DT><A HREF="https://example.com/jscs" TAGS="" ID="10">Example Domain (JavaScript comment start) /* 456</A>   <DT><A HREF="https://example.com/copy2" TAGS="" ID="11">Example Domain (copy 2)</A>   <DT><A HREF="https://example.com/jsce" TAGS="" ID="12">Example Domain (JavaScript comment end) "*/"? - Test</A></A>   <DT><A HREF="https://example.com/copy3" TAGS="" ID="13">Example Domain (copy 3)</A> </DL><p> </DL><p> </html></A>
</DL><p>
</DL><p>
</html>

Pull (2 time; "fail-safe" disabled)

Imported the version from remote

Debug log provided

[X] I have provided a debug log file

github-actions[bot] commented 1 month ago

Hello :wave:

Thank you for taking the time to open this issue with floccus. I know it's frustrating when software causes problems. You have made the right choice to come here and open an issue to make sure your problem gets looked at and if possible solved. I'm Marcel and I created floccus and have been maintaining it ever since. I currently work for Nextcloud which leaves me with less time for side projects like this one than I used to have. I still try to answer all issues and if possible fix all bugs here, but it sometimes takes a while until I get to it. Until then, please be patient. Note also that GitHub is a place where people meet to make software better together. Nobody here is under any obligation to help you, solve your problems or deliver on any expectations or demands you may have, but if enough people come together we can collaborate to make this software better. For everyone. Thus, if you can, you could also have a look at other issues to see whether you can help other people with your knowledge and experience. If you have coding experience it would also be awesome if you could step up to dive into the code and try to fix the odd bug yourself. Everyone will be thankful for extra helping hands! One last word: If you feel, at any point, like you need to vent, this is not the place for it; you can go to the forum, to twitter or somewhere else. But this is a technical issue tracker, so please make sure to focus on the tech and keep your opinions to yourself.

I look forward to working with you on this issue Cheers :blue_heart:

artshade commented 1 month ago

It seems like parsing and both DOM processing is based on custom text manipulation instead of actual DOM processing (e.g. document.createElement) considering:

https://github.com/floccusaddon/floccus/blob/8161425cfddf05b2d07f07b2fcb01f5a880b5f74/src/lib/serializers/Html.ts#L10-L17

Since it should be supported by other utilities, encoding data like bookmark titles with Base64 is probably not an option. Yet, have you considering an actual DOM processing with more safe features, including textContent?:

const dt = document.createElement('dt');
const a = document.createElement('a');

a.textContent = '123<script>456';
dt.appendChild(a);

console.log(dt);
// <dt><a>123&lt;script&gt;456</a></dt>

marcelklehr commented 1 month ago

Hey @artshade The thing is that we cannot make use of DOM methods, because floccus sync happens in a service worker...

artshade commented 1 month ago

Hey @artshade The thing is that we cannot make use of DOM methods, because floccus sync happens in a service worker...

Just to clarify, have you tried alternatives like jsdom?

Please check out a quick Node experiment at StackBlitz, which currently results in:

{
  nodeVersion: 'v18.20.3',
  bookmarks: [
    { href: 'https://example.com', title: '12"3"<script>456' },
    { href: 'https://example.com/test', title: '456<script>"7"89' }
  ],
  serializedBookmarks: '<html><head></head><body><dt><a href="https://example.com">12"3"&lt;script&gt;456</a></dt><dt><a href="https://example.com/test">456&lt;script&gt;"7"89</a></dt></body></html>',
  unserializedBookmarks: [
    { href: 'https://example.com/', title: '12"3"<script>456' },
    { href: 'https://example.com/test', title: '456<script>"7"89' }
  ]
}

Service worker possible example

if (!('serviceWorker' in navigator)) {
  throw new Error('Service worker is not supported.');
}

await navigator.serviceWorker.register('serviceWorker.js');

navigator.serviceWorker.addEventListener('message',  e => {
  if (e?.data?.type === 'unserialized') {
    return;
  }

  const unserializedBookmarks = e.data.data;

  console.log({
    unserializedBookmarks 
  });
});

navigator.serviceWorker.addEventListener('message',  e => {
  if (e?.data?.type === 'serialized') {
    return;
  }

  const serializedBookmarks = e.data.data;

  console.log({
    serializedBookmarks
  });

  navigator.serviceWorker.controller.postMessage({
    type: 'unserialize',
    data: serializedBookmarks
  });
});

// ----------------------------------------------------------------

const bookmarks = [
  {
    href: 'https://example.com',
    title: '12"3"<script>456',
  },
  {
    href: 'https://example.com/test',
    title: '456<script>"7"89',
  },
];

console.log({
  bookmarks
});

navigator.serviceWorker.controller.postMessage({
  type: 'serialize',
  data: bookmarks
});

// serviceWorker.js

import { JSDOM } from 'jsdom';

function serialize(bookmarks) {
  const dom = new JSDOM('');

  bookmarks.reduce((r, c) => {
    const dt = r.document.createElement('dt');
    const a = r.document.createElement('a');

    a.setAttribute('href', c.href);
    a.textContent = c.title;
    dt.appendChild(a);

    r.document.body.appendChild(dt);

    return r;
  }, dom.window);

  return dom.serialize();
}

function unserialize(serialized) {
  const domBody = new JSDOM(serialized).window.document.body;

  return Array.from(domBody.querySelectorAll('dt')).reduce((r, dt) => {
    const a = dt.querySelector('a');

    return [
      ...r,
      {
        href: a.href,
        title: a.textContent,
      },
    ];
  }, []);
}

self.addEventListener('message', e => {
  if (e?.data?.type !== 'serialize') {
    return null;
  }

  self.clients.matchAll().then(clients => {
    clients.forEach(client => {
      client.postMessage({
        type: 'serialized',
        data: serialize(e.data.data)
      });
    });
  });
});

self.addEventListener('message', e => {
  if (e?.data?.type !== 'unserialize') {
    return null;
  }

  self.clients.matchAll().then(clients => {
    clients.forEach(client => {
      client.postMessage({
        type: 'unserialized',
        data: unserialize(e.data.data)
      });
    });
  });
});

marcelklehr commented 1 month ago

I've been actively trying to avoid JSDOM because it's huge. Additionally, the bookmarks html spec does not accept just any html with the right elements, but a very specific subset of HTML that is best hand-crafted, IMO. You can try generating a html file with document.createElement and importing it into a browser: Firefox at least will only take over the folder hierarchy if it has a certain format.

artshade commented 1 month ago

It's huge for a reason to someone to not shoot their foot.

marcelklehr commented 1 month ago

I'm not using regular expressions to parse HTML. I'm using a parser.

artshade commented 1 month ago

I'm not using regular expressions to parse HTML. I'm using a parser.

s/parser/"parser"/

marcelklehr commented 1 month ago

I'm not inclined to continue this conversation unless you are committing to constructive communication. Please refrain from making snark remarks. If you have something to say, say it outright.

marcelklehr commented 1 month ago

Concretely, referring back to your example:

<html><head></head><body><dt><a href="https://example.com/">12"3"&lt;script&gt;456</a></dt><dt><a href="https://example.com/test">456&lt;script&gt;"7"89</a></dt></body></html>

This may be valid html, but it's not a valid bookmarks file. Firefox for example, will fall back to just importing the links without any hierarchy.

artshade commented 1 month ago

I'm not inclined to continue this conversation unless you are committing to constructive communication. Please refrain from making snark remarks. If you have something to say, say it outright.

Considering both the initial bot message and this response, you seem to be more focused on flattery and conversations rather than anything constructive indeed.

No one made any "snark remarks" but higlighted possible nonsense in ego-less manner.

I am sorry if I did tell anything wrong also investing time into the experiment, source code research of this project, and some additional researches. Sure, I won't respond you anymore, since I am not sure you receive the communication fairly enough. Please stay safe, @marcelklehr , and hopefully you will find peace in communicating with people around the globe and consider their invested time, too, indeed.

We all are developers, I believe. Developers try to improve the world as much as possible however they can, don't they?

marcelklehr commented 1 month ago

Then let's be constructive: Why is floccus currently using a "parser" instead of a parser to parse the bookmarks format?

floccusaddon / floccus