Nextomics / NextPolish

Fast and accurately polish the genome generated by long reads.
GNU General Public License v3.0
213 stars 28 forks source link

After the gap filling for scaffolds, using NextPolish. #136

Closed DayTimeMouse closed 2 months ago

DayTimeMouse commented 2 months ago

Hi,

After the gap filling for scaffolds(have N bases), I want to use NextPolish(v 1.4.1) to polish genome.

I have seen the Limitations and #98.

But I can't distinguish the part of 'N' has filled by nextpolish or without N bases as input for nextpolish. And I don't know how to link gaps back to the polished genome.

Could you give some ideas?

Best wishes.

moold commented 2 months ago

Here is the output of chatgpt when I asked “Write a Python script to split the scaffolds in a fasta file into contigs by N”,

import re
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq

def split_scaffold_into_contigs(scaffold, n_threshold=1):
    """
    Splits a scaffold sequence into contigs based on N characters.

    Args:
    scaffold (str): Scaffold sequence.
    n_threshold (int): Minimum number of consecutive Ns to split by.

    Returns:
    contigs (list of str): A list of contigs split by Ns.
    """
    # Use regex to split by consecutive 'N's
    pattern = f"N{{{n_threshold},}}"  # Split by at least `n_threshold` Ns
    contigs = re.split(pattern, scaffold)

    # Remove empty contigs that may result from leading or trailing Ns
    contigs = [contig for contig in contigs if contig]

    return contigs

def process_fasta_file(input_fasta, output_fasta, n_threshold=1):
    """
    Processes a FASTA file by splitting scaffolds into contigs based on N characters.
    Writes the contigs to a new FASTA file.

    Args:
    input_fasta (str): Path to the input FASTA file.
    output_fasta (str): Path to the output FASTA file.
    n_threshold (int): Minimum number of consecutive Ns to split by.
    """
    contig_records = []

    for record in SeqIO.parse(input_fasta, "fasta"):
        # Get scaffold sequence
        scaffold_sequence = str(record.seq)

        # Split scaffold into contigs
        contigs = split_scaffold_into_contigs(scaffold_sequence, n_threshold)

        # Create a SeqRecord for each contig and add to list
        for i, contig in enumerate(contigs):
            contig_record = SeqRecord(Seq(contig), id=f"{record.id}_contig_{i+1}", description="")
            contig_records.append(contig_record)

    # Write all contig records to the output FASTA file
    with open(output_fasta, 'w') as output_handle:
        SeqIO.write(contig_records, output_handle, "fasta")

# Example usage
input_fasta_file = "input_scaffolds.fasta"
output_fasta_file = "output_contigs.fasta"
n_threshold = 1  # Minimum number of consecutive Ns to split by

# Process the input FASTA file
process_fasta_file(input_fasta_file, output_fasta_file, n_threshold)
DayTimeMouse commented 1 month ago

Hi moold,

I have already used no Ns scffolds.fa, and record Ns' positions and counts. However, polished assembly size is not equal to scffolds.fa(no Ns), so I can't restore Ns to scffolds.fa(no Ns), due to postions may change.

What should I do?

Thanks in advance.