HaveIBeenPwned / EmailAddressExtractor

A project to rapidly extract all email addresses from any files in a given path
BSD 3-Clause "New" or "Revised" License
64 stars 23 forks source link

A basic implementation of a .ODT reader #44

Closed jfbourke closed 1 year ago

jfbourke commented 1 year ago

The OpenDocumentTextReader will extract content.xml from the .ODT, parse the file and then write the text into a temporary file which is the passed into PlainTextReader for further parsing for email addresses.

GStefanowich commented 1 year ago

Since you made PlainTextReader.DisposeAsync() virtual, could you add to the method?

GC.SuppressFinalize(this);

public async ValueTask DisposeAsync()
{
    GC.SuppressFinalize(this);
}
jfbourke commented 1 year ago

Updated PlainTextReader.DisposeAsync() to clr suppress finalizer.

GStefanowich commented 1 year ago

@jfbourke Thanks for implementing this, I don't want to do this myself so that others still get their chance to contribute :smile:

Is there a reason you went with writing to a temporary file, which is then read by the PlainTextReader?

Where you do

private static string ExtractContent(string zipPath)
{
    using (ZipArchive archive = ZipFile.OpenRead(zipPath))
    {
        foreach (ZipArchiveEntry entry in archive.Entries)
        {
            if (entry.FullName.Equals("content.xml", StringComparison.OrdinalIgnoreCase))
            {
                using(var writer = new StreamWriter(DestinationPath, true, Encoding.UTF8, 4096))
                {
                    using(var reader = new XmlTextReader(entry.Open()))
                    {
                        while(reader.Read())
                        {
                            if (reader.NodeType == XmlNodeType.Text ||
                                reader.NodeType == XmlNodeType.CDATA)
                            {
                                writer.WriteLine(reader.ReadContentAsString());
                            }
                        }
                    }
                }

                return DestinationPath;
            }
        }
    }

    throw new Exception($"Unable to load content.xml from '{zipPath}'");
}

Would it not also be doable to just implement ILineReader and return reader.ReadContentAsString()?

public async IAsyncEnumerable<string?> ReadLineAsync([EnumeratorCancellation] CancellationToken cancellation = default)
{
    using (ZipArchive archive = ZipFile.OpenRead(this.ZipPath))
    {
        foreach (ZipArchiveEntry entry in archive.Entries)
        {
            if (entry.FullName.Equals("content.xml", StringComparison.OrdinalIgnoreCase))
            {
                using(var reader = new XmlTextReader(entry.Open()))
                {
                    while(await reader.ReadAsync())
                    {
                        if (reader.NodeType == XmlNodeType.Text ||
                            reader.NodeType == XmlNodeType.CDATA)
                        {
                            yield return await reader.ReadContentAsStringAsync();
                        }
                    }
                }

                yield break;
            }
        }
    }

    throw new Exception($"Unable to load content.xml from '{this.ZipPath}'");
}

I don't see where the overhead of reading, writing, and reading again is worth it

jfbourke commented 1 year ago

@GStefanowich agree, will refactor it now.