Help about non-reference sequence detection

Johnsonzcode commented 1 year ago

Dear @npch @prenderj @RenzoTale88 @prasundutta87 Your code shows us a clear way to constrcut a pangenome. When I came to detect non-reference sequence, I can't find a help documentation. I don't understand the parameters in CattleGraphGenomePaper/detectSequences/nf-GraphSeq/conf/params.config. Could you please write a help documemtation for us or explain the parameters for us ?

Thank you in adcance!

Best Johnson

RenzoTale88 commented 1 year ago

@Johnsonzcode again, without access to the data to try the code myself it is quite tricky to debug the code. I have tried to implement a quick fix, try to copy the newer version of 09C-DetectDuplicateContigs in the bin folder of the repository. Place it into the bin folder, make it executable with chmod a+x bin/09C-DetectDuplicateContigs and resume the analysis.

Johnsonzcode commented 1 year ago

But I can't see your email. It is been hiddened. The packed graph five is 9.1Gb. How could I send to you?

RenzoTale88 commented 1 year ago

Yeah, sorry I just noticed it (I had limited internet connection yesterday). My email address is andrea.talenti@ed.ac.uk and you can find it here, searching for Andrea Talenti on the email page of the University of Edinburgh. You can share it using services like WeTransfer, dropbox, onedrive or google drive. Anyway, if the alignment.blasttab is not empty, then the graph is now compliant with the analyses, and the problem lies with the R script, whose dependencies can be tricky to coordinate in an anaconda environment (especially if you have another local installation of R). In this case, you can just share the alignments.blasttab and contigs.txt and I can try to figure out what is going on.

Johnsonzcode commented 1 year ago

I filtered out small size scaffolds without collinearity to other genome's chromosome. So there is no contigs in my graph. I faked a empty contigs.txt. The alignment.blasttab has been send via email.

Johnsonzcode commented 1 year ago

alignment.blasttab is 32Mb. Email has been rejected. So I compress it with Bandzip. You should remove suffix .txt and decompress it. alignments.blasttab.7z.txt

RenzoTale88 commented 1 year ago

@Johnsonzcode I forgot the contigs.txt is a dummy file. I'll try to implement a fix and come back to you when ready. Thanks for the file.

Johnsonzcode commented 1 year ago

Thank you for your great help!

RenzoTale88 commented 1 year ago

@Johnsonzcode apologies I confused the inputs. Could you share with me the fai file present in the same folder?

Johnsonzcode commented 1 year ago

Sorry for late. Of cource. candidate.fa.fai.txt

Johnsonzcode commented 1 year ago

@RenzoTale88 Could I ask the situation?

RenzoTale88 commented 1 year ago

Working on it

RenzoTale88 commented 1 year ago

@Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this

Johnsonzcode commented 1 year ago

OK, I will try to find the issue by myself. I think it is R packages issue from the log.

RenzoTale88 commented 1 year ago

@Johnsonzcode if you have R installed in the same system you run nextflow, you can try matching the version of R in environment.yml with the one you have installed in your machine. I noticed sometimes using R in anaconda in nextflow causes some bizarre issues, though quite unpredictable and therefore difficult to fix.

Johnsonzcode commented 1 year ago

@Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this

Do you think it is the R packages version issue? If I try the pipeline by the new version of 09C-DetectDuplicateContigs, Is it still the error?

RenzoTale88 commented 1 year ago

Difficult to tell. I would recommend the new script because it introduces some new compatibility fixes. If the problem persists, I would try the R version solution.

From: johnsonz @.> Sent: Wednesday, January 11, 2023 10:35:21 AM To: evotools/CattleGraphGenomePaper @.> Cc: RenzoTale88 @.>; Mention @.> Subject: Re: [evotools/CattleGraphGenomePaper] Help about non-reference sequence detection (Issue #1)

@Johnsonzcodehttps://github.com/Johnsonzcode I could not reproduce the bug you came across, but fix other to make the code compliant with recent versions of R. I've uploaded a new version of 09C-DetectDuplicateContigs in case you want to try it. However, keep in mind I did not come across the same issue, so I can't be sure I fixed this

Do you think it is the R packages version issue? If I try the pipeline by the new version of 09C-DetectDuplicateContigs, Is it still the error?

— Reply to this email directly, view it on GitHubhttps://github.com/evotools/CattleGraphGenomePaper/issues/1#issuecomment-1378471930, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFRTPKAI6SMCQNGBK3BIOSTWRZ5FTANCNFSM6AAAAAATHRV4AE. You are receiving this because you were mentioned.Message ID: @.***>

Johnsonzcode commented 1 year ago

Than you for your recommandation.

RenzoTale88 commented 1 year ago

@Johnsonzcode did you manage to run the workflow?

Johnsonzcode commented 1 year ago

@RenzoTale88, Sorry for long time no reply. I am trying to run. But the situation is like before. No progress.

Johnsonzcode commented 1 year ago

Dear @RenzoTale88 I am trying to add variants to the graph. After combining chromosome level graph with vg combine. One individual vcf file was selected to add for testing.

vg combine CHR*/rawCHR*.vg > merged.vg
vg add -v one_sample.vcf -t 32 > var_add_merged.vcf

But it turns out that contig breed1.CHR1 not in graph. I got a series of questions.

Should SVs in population be added into graph by sample or by breed?
How could I filter the SVs by sample or breed?
If by breed, how could I merge them, from your scripts SURVIVOR is prefer, but how to tune the setting?
Lastly, if vcf is merged. Is the vcf file should be added into graph chromosome by chromosome?

For short, maybe like this way:

### pesudo code
for breed in dir:
do
    SURVIVOR merge samples_in_breed.txt 100 1 1 0 0 50 > SV50_100BND.vcf
done
SURVIVOR merge all_breed_vcf.txt 100 1 1 0 0 50 > SV50_100BND_all_breed.vcf
### for chromosome vg: 
for i in seq(1 N); do
hal2vg --noAncestors --refGenome myref chr${i}.hal > tmp${i}.vg
vg mod -X 32 tmp${i}.vg > tmp${i}.chop32.vg
vg explode tmp${i}.chop32.vg exploded/
vg prune -r chr${i}.final.vg > chr${i}.final.pruned.vg
done
vg index -x all.xg {i}.final.vg ; done)
vg index -g all.gcsa (seq 1 N); do echo chr${i}.final.pruned.vg; done)

# then the SV50_100BND_all_breed.vcf should be split into SV50_100BND_all_breed_chr*.vcf by chromosome.
vcftools --vcf --chr * SV50_100BND_all_breed.vcf > SV50_100BND_all_breed_chr*.vcf

# then GraphVCF.py and AddToGraph.sh should be used chromosome by chromosome.

Am I right?

All the best Johnson

RenzoTale88 commented 1 year ago

@Johnsonzcode I'll try to run some more tests on the workflow whenever I get time. If I manage to reproduce the error you come across I'll come back to you. Regarding the use of VG I recommend asking the VG developer directly: they will be able to provide better support for your use case, especially on the most recent versions of VG. Last time I added variants to the graph was 23 releases ago!

Johnsonzcode commented 1 year ago

@RenzoTale88 Many thanks! Okay, I will ask for VG team. But is the workflow above right to add variants for the history release?

RenzoTale88 commented 1 year ago

If you want to extract SVs from a cactus graph you might as well try to use the minigraph-cactus pangenome pipeline, which is more designed for this. If you want to reproduce what we have done, have a look at the scripts here, starting with submit.sh to see the order we ran them.

Johnsonzcode commented 1 year ago

Dear @RenzoTale88 What I am tring to do is that, in addition to the six assemblies of chromosome level used to construct the graph-based pan-genome, I also want to add the structural variation at the population level to the graph-based pan-genome, as described in your article. From your scripts here. But no mention is made of how structural variation is added to the grpah-based pangenome. AddToGraph.sh seems to do that. I am not sure what these lines do:

graph=`head -n $SGE_TASK_ID $gv | tail -1`
if [ ! -z $vc ]; then
    vcf=`head -n $SGE_TASK_ID $vc | tail -1`
    spp=$sn
    if [ ! -e $vcf".tbi" ]; then tabix -p vcf $vcf; fi
fi
gName=`basename -s ".vg" $graph`
export TMPDIR=`pwd`

echo $vcf
echo $spp
echo $graph

if [ ! -z $vcf ] && [ ! -z $spp ] ; then 
    echo $spp
    echo "${vg}/vg add -n ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID} -v $vcf $graph > GRAPH/${gName}.tmp.vg"
    vg mod -X 32 $graph | vg ids -s - | vg add -n ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID} -v $vcf > GRAPH/${gName}.tmp.vg
    vg ids -s GRAPH/${gName}.tmp.vg > GRAPH/${gName}.final.vg && rm GRAPH/${gName}.tmp.vg
else
    vg mod -X 32 $graph | vg ids -s - > GRAPH/${gName}.final.vg
fi

Especially, ${SGE_TASK_ID}=${spp}.${SGE_TASK_ID}, is that says you are adding variants to graph-based by chromosome?

All the best Johnson

RenzoTale88 commented 1 year ago

In that particular script we add the variants from a vcf file using VG add on a chromosome by chromosome basis, providing the backbone genome as ${spp} and the chromosome number as the SGE variable SGE_TASK_ID. The option you mention is to link the chromosome to the right species, since the VCF is from a linear genome study. Again, you should really ask the VG team for help in this. We're not developing the tool, and are in no position to offer you guidance over how to perform your analyses, especially after 23 releases of the tools we used.

Johnsonzcode commented 1 year ago

@RenzoTale88 Thank you very much for your kind，You've helped me so much!

evotools / CattleGraphGenomePaper

Help about non-reference sequence detection #1